-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to group search results by field [LUCENE-1421] #2495
Comments
Martijn van Groningen (@martijnvg) (migrated from JIRA) This is an initial patch that allows result grouping with Lucene via a Collector and an attempt to integrate result grouping into Lucene / Solr. The collector can be used just like any other collector and returns TopDocs. The TopDocs contains GroupDoc instances, which is a subclass of ScoreDoc. I think this way it is easier to integrate grouping into existing code that uses Lucene (like Solr). I think that grouping code should be part of Lucene instead of Solr. I put the result grouping into a new contrib that I named grouping. Putting it in a contib seemed the right place for me. The patch doesn't contain any Solr code and I think a new issue in Solr should be opened for that. This patch is 'inspired by' by SOLR-236, but only contains its core functionality. Nonadjacent grouping based on field value with group counts. Also in the code i don't use the verb collapsing but grouping. This patch is also faster then the Solr variants. This because the grouping occurs whilst the documents are collected and thus saves multiple searches. Also the grouping algorithm itself is improved. Although this is work in progress any thought about this would be appriciated. BTW the patch is based on the Lucene trunk and is relative to the trunk directory |
Michael McCandless (@mikemccand) (migrated from JIRA)
+1 This is a very popular issue (currently tied for 2nd place in votes). Unfortunately, I think the single-pass collector attached here doesn't So I decided to factor out parts of Solr's current two-pass approach The downside of the two-pass approach is you run the query twice, But one nice side effect of the two-phased approach is that sharding |
Michael McCandless (@mikemccand) (migrated from JIRA) Initial rough patch. I think it's working well. I reused the test We cannot yet cutover Solr to this module because it doesn't support I plan to backport this to 3.x as contrib/grouping. |
Bill Bell (migrated from JIRA) The issue I have with the group=true feature in Solr is that the facets are not calculated post grouping. If we can get the facets to return counts POST grouping that would be ideal. Bill |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Nice work Michael! I also think that the two pass mechanism is definitely the preferred way to go. I think we also need a strategy mechanism (or at least an GroupCollector class hierarchy) inside this module. The mechanism should select the right group collector(s) for a certain request. Some users maybe only care about the top group document, so I second pass won't be necessary. Another example with faceting in mind. When group based faceting is necessary. The top N groups don't suffice. You'll need all group docs (I currently don't see a other way). These groups docs are then used to create a grouped Solr DocSet. But this should be a completely different implementation. |
Michael McCandless (@mikemccand) (migrated from JIRA) Patch w/ next iteration... I beefed up the overview.html, added test case coverage of "null" groupValue. I think it's ready to commit and then back-port to 3.x! |
Michael McCandless (@mikemccand) (migrated from JIRA)
I agree, there's much more we could do here! Specialized collection for the maxDocsPerGroup=1 case, and for the "I want all groups" case, would be nice. For the "not many unique values in the group field" case we could do a single-pass collector, I think. Grouping by a multi-valued field should be possible (we now have DocTermOrds in Lucene, but it doesn't load the term byte[] data), as well as support for sharding, ie, by merging top groups and docs w/in each group (but I think we need an addition to FieldComparator API for this). I think we should commit this starting point, today, and then iterate from there... Martijn, thank you for persisting for so long on SOLR-236! We are |
Michael McCandless (@mikemccand) (migrated from JIRA)
How would the field values for the group be defined...? Or would facets run on all not-collapsed docs...? |
Bill Bell (migrated from JIRA) Say we have 4 documents: docid=1 docid=2 docid=3 docid=4 If we group by hgid, we would get: hgid=1
hgid=3
hgid=4 If I set Facet Counts = POST age: 10 (1 document) If I set Facet Counts = PRE age: 10 (2 document) The only way grouping works in Solr now is Facet Counts = PRE. Thanks. |
Michael McCandless (@mikemccand) (migrated from JIRA) But what if docid=2 had age=17 instead? How would we determine what value the group (for hgid=1) should have for the "age" field? Or... would the group count +1 to age=10 and +1 to age=17 in that case? (ie, as if the group were a single document w/ multi-valued field age). |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
That would depend on the group sort, right? If your group sort is age asc the lowest document would be chosen.
You mean like a sum of all ages per group? That is interesting, but sounds more like a function to me. This can be computed with a separated group collector. Wouldn't make sense to me, to have this with a regular field facet. |
Michael McCandless (@mikemccand) (migrated from JIRA)
OK, I see. So the group is "represented" by the doc within it that
Well, not sum, but multi-valued? (Ie, as if this group were This way, if the user then does a drill-down by a specific age, the I agree we need to hash out these semantics :) Bill could you open a separate Lucene issue, to work out the semantics |
Martijn van Groningen (@martijnvg) (migrated from JIRA) Michael I see you have committed it to the trunk. Nice work! As for porting this code to the 3x branch I see that this branch doesn't have modules. Does it mean that it will be a Lucene contrib? |
Michael McCandless (@mikemccand) (migrated from JIRA) Woops, it should not be protected! I'll fix... thanks Martijn! Yes, I ported to 3.x as contrib/grouping, so this will be released when we release 3.2. |
Robert Muir (@rmuir) (migrated from JIRA) Hi Martijn, in 3.x it is available as a lucene contrib (contrib/grouping). As far as classes being package-private, I don't think Mike intended this, as they are marked experimental. Want to upload a patch that ensures everything you need has correct visibility? |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Only the SearchGroup class had different visibility compared to the patch. And Michael says he is going to change that. So I think a patch for that is a bit overkill.
Cool! After a svn update I see the contrib now as well. The question is how to go from here. Continue development in this issue or for each new grouping related feature a separate issue? I haven't see an issue regarding post faceting yet. I can create one if necessary? |
Robert Muir (@rmuir) (migrated from JIRA)
Ok, thanks for taking a look... I just figured we could tackle any visibility issues at once, but it seems this is the only one.
Sure! of course it will unfortunately need to be blocked on #4152, but it seems like a good idea to have the issue open for planning.
My opinion would be to open new issues for each new grouping feature! This way things can get committed faster and its easier to These apis are marked experimental so I don't think we should waste time on backwards compatibility, nor should we try to come up with |
Michael McCandless (@mikemccand) (migrated from JIRA) Resolving this... we can iterate in further issues. Thanks Martijn! |
Michael McCandless (@mikemccand) (migrated from JIRA) I adding grouping queries to the nightly benchmarks Those queries are the same queries running as TermQuery, just with I use the CachingCollector. First off, I'm impressed that the perf hit for grouping is not too
I had expected we'd pay a bigger perf hit! Second, there more unique groups you have, the slower grouping gets, Remember, though, that these groups are randomly generated Third, and this is insanity, the addition of grouping caused other Similarly strange, when I added sorting (TermQuery sorting by title |
Martijn van Groningen (@martijnvg) (migrated from JIRA)
Nice! Are the regular sort and group sort different in these test cases? Do think when new features are added that these also need be added to this test suite? Or is this perfomance test suite just for the basic features? |
Michael McCandless (@mikemccand) (migrated from JIRA) I'm only testing groupSort and sort by relevance now in the nightly bench. I'll add sort-by-title, groupSort-by-relevance cases too, so we test that. Hmm, though: this content set is alphabetized by title I believe, so it's not really a good test. (I suspect that's why the TermQuery sorting by title is faster
Well, in general I'd love to have wider coverage in the nightly perf test... really it's only a start now. But there's no hard rule we have to add new functions into the nightly bench... |
Robert Muir (@rmuir) (migrated from JIRA) Bulk closing for 3.2 |
It would be awesome to group search results by specified field. Some functionality was provided for Apache Solr but I think it should be done in Core Lucene. There could be some useful information like total hits about collapsed data like total count and so on.
Thanks,
Artyom
Migrated from LUCENE-1421 by Artyom Sokolov, 9 votes, resolved May 14 2011
Attachments: LUCENE-1421.patch (versions: 2), lucene-grouping.patch
Linked issues:
The text was updated successfully, but these errors were encountered: