-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields [LUCENE-8196] #9243
Comments
Adrien Grand (@jpountz) (migrated from JIRA) Thanks Alan. I agree that growing a separate hierarchy of objects might help land this feature. We might even want to put first iterations of this work in sandbox to give time for the API to stabilize before we move it to core or misc. I have some questions/comments:
|
Alan Woodward (@romseygeek) (migrated from JIRA)
Yes, terms and phrases return 1 instead than taking the overall width into account. This is so that they score the same as TermQuery and PhraseQuery
Unfortunately yes, or at least we need a way of resetting the iterator on each new document. It might be possible to avoid passing the doc down and having a return value though, I'll see what I can do.
This is a bit trickier, as it's not obvious where the wrapping would happen. +1 to everything else, I'll work on a follow-up. |
Jim Ferenczi (@jimczi) (migrated from JIRA)
public abstract class IntervalsSource {
protected final String field;
public IntervalsSource(String field) {
this.field = field;
}
public abstract IntervalIterator intervals(LeafReaderContext ctx) throws IOException;
... ... and then we can check in each implementation that the sources are all targeting the same field. Regarding the score of the intervals, it seems that the patch uses the inverse length of the interval rather than the slop within the interval like the sloppy phrase scorer. Could we compute the total slop of the current interval (as the sum of the slop of each interval source that composed this interval) and use its inverse to score each ? This would make different interval query more comparable in terms of score since an interval with few terms and a slop>0 would score less that one with more terms but no slop. I'll look deeper at the implementation of the different queries but I like the simplicity of the patch and the fact that there is a paper with a proof for each of them. |
Alan Woodward (@romseygeek) (migrated from JIRA) I opened a pull request at apache/lucene-solr#334 to make this easier to review. @jpountz I think I've addressed most of your feedback? @jimczi I'd rather keep the API as it is, with the field being passed to IntervalQuery and then recursing down the IntervalSource tree. Otherwise you end up having to declare the field on all the created sources, which seems redundant. I've removed the cross-field hack entirely for the moment. I'll see if I can improve the scoring next. |
Alan Woodward (@romseygeek) (migrated from JIRA) I discussed scoring with @jimczi and @jpountz offline, and we decided to just use the inverse length of intervals as a sloppy frequency for now, as described in the Vigna paper linked above. This means that we can't compare scores directly with existing phrase queries, but the query mechanism is quite different (particularly for SloppyPhraseScorer) so it makes sense that scores won't be the same either. |
Jim Ferenczi (@jimczi) (migrated from JIRA)
+1 to remove the cross-field hack, thanks. Regarding the API it's ok since IntervalQuery limits all sources to one field so I am fine with that (I misunderstood how the IntervalQuery can be used). |
Alan Woodward (@romseygeek) (migrated from JIRA) This patch makes IntervalIterator extend DocIdSetIterator, and makes the per-document reset() function protected and called automatically on nextDoc() and advance(target). |
Alan Woodward (@romseygeek) (migrated from JIRA) This patch moves everything into the sandbox for now, and adds a package-info explaining how to use things. I had to duplicate DisiWrapper, DisiPriorityQueue and DisjunctionDISIApproximation, but I don't think that's too much of a problem. Having things in sandbox should reduce confusion with Span queries, and give us time to try and switch things over. I think this is ready to be committed. |
Adrien Grand (@jpountz) (migrated from JIRA) Some comments:
Otherwise +1 |
David Smiley (@dsmiley) (migrated from JIRA)
|
Jim Ferenczi (@jimczi) (migrated from JIRA) +1 too, there are some places where you could initialize the current interval with[−∞ . . −∞] in order to avoid the nullity check. Most of the operators algorithm seem good, though I don't understand why you change the order of the disjunction ? If you don't start with the smallest right interval from the queue you could miss a lot of minimum intervals that could be needed if the disjunction is used inside another operator ? |
Alan Woodward (@romseygeek) (migrated from JIRA) I moved the approximation and reset() from IntervalIterator, and it turned out that ConjunctionIntervalIterator was a convenient place to put them, so I kept that in. I also cleaned up the utility classes (no need to check for TwoPhaseIterator or BitSetIterator when we know that the leaves are always PostingsEnum) and moved to org.apache.lucene.search.intervals. Payload matches can be added trivially at a later point. Payload scoring will be a more interesting one, but I'm not entirely happy with the way scoring works at the moment anyway, I need to read around a bit more on good ways of scoring proximity queries in general. Offsets may not be trivial to add, as we have the same problem we originally had with Spans here, in that some of the algorithms advance leaf intervals before returning. Something to consider later on, definitely. The PQ specialization is just lifted directly from DisjunctionDISIApproximation in the existing core search package, so it wasn't my conclusion :) re extractTerms() - I don't like this method being here in the first place really, see my comment about scoring above. I think let's keep it simple for now, you can always filter afterwards. @jimczi - where can I remove null checks? I think I'm constrained by having to return [-1..-1] when iterator has moved to a new doc but hasn't been advanced over intervals yet. The disjunction order is to address #8451. If a disjunction is appearing within a block, then sorting for minimum intervals can miss valid matches. The Vigna paper doesn't seem to discuss this case anywhere. One possible solution that would work in all cases would be to add a boolean to IntervalsSource.getIntervals() that indicates whether or not the source is in a final position - if it is, then it should return minimal intervals, otherwise it should return wider ones. This would only be applicable to disjunctions. |
Alan Woodward (@romseygeek) (migrated from JIRA) I talked over disjunction ordering with @jimczi, and we agreed to revert back to the ordering specified in the original paper. In future it might be worth contacting the authors and seeing if they've covered the case of prefix disjunctions elsewhere. I think this is ready? |
Jim Ferenczi (@jimczi) (migrated from JIRA) +1 |
ASF subversion and git services (migrated from JIRA) Commit 974c03a in lucene-solr's branch refs/heads/branch_7x from @romseygeek LUCENE-8196: Add IntervalQuery and IntervalsSource to the sandbox |
ASF subversion and git services (migrated from JIRA) Commit 00eab54 in lucene-solr's branch refs/heads/master from @romseygeek LUCENE-8196: Add IntervalQuery and IntervalsSource to the sandbox |
Alan Woodward (@romseygeek) (migrated from JIRA) Thanks all! |
ASF subversion and git services (migrated from JIRA) Commit b772b58 in lucene-solr's branch refs/heads/branch_7x from @romseygeek LUCENE-8196: Check that the TermIntervalsSource is positioned on the correct term |
ASF subversion and git services (migrated from JIRA) Commit 1d6502c in lucene-solr's branch refs/heads/master from @romseygeek LUCENE-8196: Check that the TermIntervalsSource is positioned on the correct term |
ASF subversion and git services (migrated from JIRA) Commit 7fcaac8 in lucene-solr's branch refs/heads/branch_7x from @romseygeek LUCENE-8196: Check for a null input in LowpassIntervalsSource |
ASF subversion and git services (migrated from JIRA) Commit 7117b68 in lucene-solr's branch refs/heads/master from @romseygeek LUCENE-8196: Check for a null input in LowpassIntervalsSource |
Matt Weber (@mattweber) (migrated from JIRA) @romseygeek This is great! How would we prevent matching at the same interval? In public void testUnorderedQueryNoSelfMatch() throws IOException {
Query q = new IntervalQuery(field, Intervals.maxwidth(2, Intervals.unordered(Intervals.term("w3"), Intervals.term("w3"))));
checkHits(q, new int[]{1});
} |
Alan Woodward (@romseygeek) (migrated from JIRA)
The original paper doesn't look like it addresses this. I'll try and work out the best way of dealing with things, I guess we'll need to keep track of the positions of internal intervals in the priority queue, and when we advance make sure that they don't collide. |
Jim Ferenczi (@jimczi) (migrated from JIRA) I don't think we should prevent anything ;). unordered is a conjunction operator so it should match if all terms match (which is the case in your example) so these results are expected IMO. Maybe we should rename unordered to and in order to avoid confusion ? Query q = new IntervalQuery(field, Intervals.maxwidth(2, Intervals.ordered(Intervals.term("w3"), Intervals.term("w3")))); @romseygeek while I was playing with unordered I realized that we don't protect against sources that match but don't have intervals. Query q = new IntervalQuery(query, Intervals.unordered(Intervals.term("w2"), Intervals.ordered(Intervals.term("w3"),Intervals.term("w3")))); does not work because the unordered query doesn't check if the sub source has intervals when it adds it in the queue. |
Alan Woodward (@romseygeek) (migrated from JIRA) Good catch @jimczi, I'll commit that change. I like the idea of changing unordered to and as well - I think that makes sense @mattweber? |
Matt Weber (@mattweber) (migrated from JIRA) @jimczi @romseygeek I think rename to |
Jim Ferenczi (@jimczi) (migrated from JIRA) I don't think an operator can prevent anything here, a query for Intervals.ordered(Intervals.term("w3"), Intervals.term("w3")) should always return all intervals of the term "w3" (it will not interleave successive intervals of "w3"). @mattweber why do you think that this "scenario" should be prevented ? When I do "foo AND foo" I don't expect it to match only document that have foo twice ? |
Matt Weber (@mattweber) (migrated from JIRA) I use these queries to build query parsers and I am specifically thinking of an unordered near and how I can prevent it from matching the same term. I can't think of any situation where a user would think |
Matt Weber (@mattweber) (migrated from JIRA) So given a single document with the value Intervals.unordered(Intervals.term("b"), Intervals.term("a")) Intervals.unordered(Intervals.term("b"), Intervals.term("b")) The first I think would have an interval width of |
Alan Woodward (@romseygeek) (migrated from JIRA) I think minwidth() would run into problems with documents that have two instances of 'b', because unordered will always find the minimal intervals, so it would always end up with intervals of width 0, which would then be rejected by the filter, and you'd end up with missing matches. What we really need here I think is a new source, something like 'unordered-non-overlapping', which checks that all of the internal intervals are separated. With a better name, of course :) . And we should rename 'unordered' to 'and' to make the semantics a bit clearer. |
ASF subversion and git services (migrated from JIRA) Commit 345fdff in lucene-solr's branch refs/heads/branch_7x from @romseygeek LUCENE-8196: Fix unordered case with non-matching subinterval |
ASF subversion and git services (migrated from JIRA) Commit 1a18acd in lucene-solr's branch refs/heads/master from @romseygeek LUCENE-8196: Fix unordered case with non-matching subinterval |
Alan Woodward (@romseygeek) (migrated from JIRA) I opened #9347 to deal with unordered overlaps. |
Martin Hermann (migrated from JIRA) First of all, I really like this implementation and the ideas that went into it. But as I have spent quite some time with the old span queries and their problems, I'd like to comment on some things and maybe offer some fresh view points for old problems: Intervals.maxwidth(3,
Intervals.ordered(
Intervals.or(
Intervals.maxwidth(2,
Intervals.ordered(Intervals.term("big"),Intervals.term("bad"))),
Intervals.term("evil")),
Intervals.term("wolf"))); which also matches "evil eyes wolf", which should not be a match. It would be possible to rewrite the query so that the disjunction is at the top level, something like Intervals.or(
Intervals.maxwidth(2,
Intervals.ordered(Intervals.term("evil"),Intervals.term("wolf"))),
Intervals.maxwidth(3,
Intervals.ordreed(Intervals.term("big"),Intervals.term("bad"),Intervals.term("wolf")))); which would work as expected, but I think we can agree that this is not really a nice solution (but I will come back to it later).
Now, we already know that "(big OR "big bad") wolf" would not match "big bad wolf" (this is exactly the genome editing thing), but I think it is worth to point out exactly why: It actually should not match, according to the definition of "minimum interval": Any match for "big bad" is also a match for big, so the first IntervalsSource only passes matches for "big", and then we get no match for "big wolf". This is a feature of the query semantic of the paper (and maybe the reason for the efficency and simplicity of the algorithms): The problems that spanQueries had are gone, because we define the unexpected behaviour to be correct*. As much as I like the IntervalQueries, I do not really think this is satisfactory.
does not match "big bad wolf", in contrast to
, which does. So we actually lose a match by adding a OR-clause, and I think we can agree that this is not really good. Of course these are not queries a human would write, but I think one major use case of span queries is some sort of automatic query generation, and that's where I think it is really important to meet at least some basic expectations (such as not losing matches by adding disjunctions). *A short theoretical aside: I think that most of the span query problems came from the fact that we want to have a "next match" function, i.e. some sort of ordering of matches, together with the nature of span query Matches, which are essentially a pair of numbers (start and end of match). This means we have to specify an order on pairs of numbers (which is possible, of course; the solution with span queries was a lexical order, i.e. the start always increases, and if it stays the same, the end increases). But I think it is not really possible to implement completly lazy behaviour with this ordering: Think of some ordered "((a OR b) followed by (c OR d)) with enough slop" and the document "a b c d" which should find "a b c d" before "b c" (as the start increases), but has to cache the match for "c", which (in the sub-query "(c OR d)) occurs before the one for "b". So the combination of ordering and lazy is where we get problems. The vigna paper is now quite clever, in that the very definition of "minimum interval" solves this problem, as the pair of start and end positions gets essentially reduced to a single number (e.g. start position), as it is not possible for one of them to increase with the other staying the same or even decreasing (as otherwise one of the matches would be contained in the other). I actually think that a possible solution would be to give up the "lazy" part instead and have some kind of caching, but that is not what this ticket is about. |
Alan Woodward (@romseygeek) (migrated from JIRA) Hi Martin Hermann Thanks for the detailed feedback - this is very helpful!
|
Martin Hermann (migrated from JIRA)
or adding the gap automatically somewhere down the way? I think in the first case it'd still be possible to construct some (maybe a little bit more complicated) examples that can't be solved like that and where the minimal intervals behaviour does not match intuition. Again, while a lot of these queries may seem quite exotic, I think that intervals will get used a lot various programmatically generated queries (as spans do now), and there pretty much anything can happen. |
Alan Woodward (@romseygeek) (migrated from JIRA) Martin Hermann seeing as this ticket is closed, I opened #9523 to continue discussion. |
This ticket proposes an alternative implementation of the SpanQuery family that uses minimum-interval semantics from http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf to implement positional queries across term-based fields. Rather than using TermQueries to construct the interval operators, as in #3952 or the current Spans implementation, we instead use a new IntervalsSource object, which will produce IntervalIterators over a particular segment and field. These are constructed using various static helper methods, and can then be passed to a new IntervalQuery which will return documents that contain one or more intervals so defined.
Migrated from LUCENE-8196 by Alan Woodward (@romseygeek), resolved Apr 03 2018
Attachments: LUCENE-8196.patch (versions: 5), LUCENE-8196-debug.patch
Linked issues:
Pull requests: apache/lucene-solr#334
The text was updated successfully, but these errors were encountered: