-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Facet benchmarking [LUCENE-3262] #4335
Comments
Toke Eskildsen (@tokee) (migrated from JIRA) I've attached a second shot at faceting performance testing. It separates the taxonomy generation into a CorpusGenerator (maybe similar to the RandomTaxonomyWriter that Robert calls for in #4337?). Proper setup of faceting tweaks for the new faceting module is not done at all and not something I find myself qualified for. |
Doron Cohen (migrated from JIRA) I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646.
Should have a start patch in a day or two. |
Doron Cohen (migrated from JIRA) Patch (3x) with working facets indexing benchmark.
'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets. Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue. |
Shai Erera (@shaie) (migrated from JIRA) Patch looks good ! I have a couple of initial comments:
Looks very good. Now with FacetSource we can generate facets per the case we want to test (dense hierarchies, Zipf'ian ...) |
Doron Cohen (migrated from JIRA) Thanks for reviewing Shai!
/**
* Set the taxonomy reader. Takes ownership of that taxonomy reader, that is,
* internally performs taxoReader.incRef().
* `@param` indexReader The indexReader to set.
*/
public synchronized void setTaxonomyReader(TaxonomyReader taxoReader) throws IOException {
}
/**
* Set the index reader. Takes ownership of that index reader, that is,
* internally performs indexReader.incRef().
* `@param` indexReader The indexReader to set.
*/
public synchronized void setIndexReader(IndexReader indexReader) throws IOException {
}
Thanks for the review, working on a new patch - there are several copy/paste errors in the code where a CloseTaxonomyReader by mistake sets the PFD IndexReader to null... |
Shai Erera (@shaie) (migrated from JIRA)
I don't have that warning turned on in Eclipse. I disabled it for exactly this reason :).
The new name is ok, and the properties better fit it. BTW, if you wanted to have the .algs out there to not silently fail, you could add some code to setConfig that checks for these outdated properties, and throw a proper exception. But I'm ok with the solution you chose.
The javadocs are good. I'd also add "<b>NOTE:</b> if you no longer need that IndexReader/TaxoReader, you should decRef()/close() after calling this method". Otherwise, the IR/TR will just stay open ... |
Doron Cohen (migrated from JIRA) Updated patch with a test, more javadocs, and a comment as Shai suggested. I think this is ready to commit. More tests are needed, and also Search with facets is missing, but that can go in a separate issue. |
Shai Erera (@shaie) (migrated from JIRA)
+1. Perhaps just add a CHANGES entry?
I think it's better if we resolve it in that issue, and maybe rename the issue to "Facet benchmarking framework". You can still commit the current progress because it is 'whole' - covering the indexing side. I've worked on issues before that had several commits, so this will not be the first one. We should also run some benchmark tests, describing clearly the data sets, but this can be done under a separate issue. |
Doron Cohen (migrated from JIRA)
Right, I always forget to include it in the patch, and add it only afterwords, should change that... Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it. |
Doron Cohen (migrated from JIRA) Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true. |
Shai Erera (@shaie) (migrated from JIRA)
What do you mean? Someone can use AddFacetedDocTask w/ and w/o facets? What for? (sorry, but you didn't post a patch, so I cannot see what this is about) |
Doron Cohen (migrated from JIRA)
It is useful for specifying the property like this: with.facets=facets:true:false
...
{ "MAddDocs" AddFacetedDoc > : 400 and then getting in the report something like this:
|
Doron Cohen (migrated from JIRA) Updated patch according to Shai's comments and with AddFacetedDoc task. |
Shai Erera (@shaie) (migrated from JIRA) Ahh, forgot about iterations. It does indeed look useful then. Perhaps mention facet.source in AddFacetedDocTask? I'm +1 for committing the current progress, but keep this issue open for the search side (to complete the framework). |
Gilad Barkai (migrated from JIRA) Doron, great patch! I ran it and was somewhat surprised at the large overhead of the facet indexing. Digging deeper, I found the number of random facets to be 1-120 per document, with depth of 1-8. I believe those are overkill requirements. I reduced those to 1-20 per document with depth of 1-3 and got results I could live with. Also, I changed the alg to consume the entire content source. I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth. Depth seems more appropriate. Other than that - I'm thrilled to have a working benchmark with facets - thanks! |
Shai Erera (@shaie) (migrated from JIRA)
I think we should maybe have a Wiki page or something with several .alg files that test different scenarios. I don't think that 1-120 is an example we shouldn't test. Rather, we should describe, either in a Wiki or a JIRA issue, the performance results for each scenario. And if the results are suspicious for a particular scenario, dig deeper and understand why. So given that you know the numbers from above were run with that many facets per document, do the numbers make sense? Or you still think they're high?
+1. |
Doron Cohen (migrated from JIRA)
I agree, tried this too now and the comparison is more reasonable. Changing the defaults to 20/3 and preparing to commit. |
Doron Cohen (migrated from JIRA) Committed to 3x in r1180637, thanks Gilad! |
Doron Cohen (migrated from JIRA) After manually removing benchmark/{work,temp} reuters collection was correctly extracted and in trunk the alg runs same as in 3x. |
A spin off from #4152. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).
Toke attached a preliminary test case to #4152, so I'll attach it here as a starting point.
We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.
We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.
Migrated from LUCENE-3262 by Shai Erera (@shaie), updated Oct 09 2011
Attachments: CorpusGenerator.java, LUCENE-3262.patch (versions: 3), TestPerformanceHack.java
The text was updated successfully, but these errors were encountered: