Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facet benchmarking [LUCENE-3262] #4335

Open
asfimport opened this issue Jun 30, 2011 · 19 comments
Open

Facet benchmarking [LUCENE-3262] #4335

asfimport opened this issue Jun 30, 2011 · 19 comments

Comments

@asfimport
Copy link

asfimport commented Jun 30, 2011

A spin off from #4152. We should define few benchmarks for faceting scenarios, so we can evaluate the new faceting module as well as any improvement we'd like to consider in the future (such as cutting over to docvalues, implement FST-based caches etc.).

Toke attached a preliminary test case to #4152, so I'll attach it here as a starting point.

We've also done some preliminary job for extending Benchmark for faceting, so I'll attach it here as well.

We should perhaps create a Wiki page where we clearly describe the benchmark scenarios, then include results of 'default settings' and 'optimized settings', or something like that.


Migrated from LUCENE-3262 by Shai Erera (@shaie), updated Oct 09 2011
Attachments: CorpusGenerator.java, LUCENE-3262.patch (versions: 3), TestPerformanceHack.java

@asfimport
Copy link
Author

asfimport commented Jul 7, 2011

Toke Eskildsen (@tokee) (migrated from JIRA)

I've attached a second shot at faceting performance testing. It separates the taxonomy generation into a CorpusGenerator (maybe similar to the RandomTaxonomyWriter that Robert calls for in #4337?).

Proper setup of faceting tweaks for the new faceting module is not done at all and not something I find myself qualified for.

@asfimport
Copy link
Author

asfimport commented Oct 4, 2011

Doron Cohen (migrated from JIRA)

I am working on a patch for this, much in the lines of the Solr benchmark patch in SOLR-2646.
Currently the direction is:

  • Add to PerfRunData:

    • Taxonomy Directory
    • Taxonomy Writer
    • Taxonomy Reader
  • Add tasks for manipulating facets and taxonomies:

    • create/open/commit/close Taxonomy Index
    • open/close Taxonomy Reader
    • AddDocWith facets
  • FacetDocMaker will also build the categories into the document

  • FacetSource will bring back categories to be added to current doc

  • ReadTask will be extended to also support faceted search.
    This is different from the Solr benchmark approach, where a SolrSearchTask is not extending ReadTask but rather extending PerfTask.
    Not sure yet if this is the way to go - still work to be done here.

Should have a start patch in a day or two.

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Patch (3x) with working facets indexing benchmark.
It follows the outline above, except that:

  • there is no FacetDocMaker - only FacetSource
  • there is no AddDocWithFacet - instead, AddDoc takes an additional config param: with.facet

'ant run-task -Dtask.alg=conf/facets.alg' will run an algorithm that indexes facets.

Not ready to commit yet - need some testing and docs. Also, only covers indexing for now, though perhaps search with facets can go in a separate issue.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Patch looks good ! I have a couple of initial comments:

  • facets.alg: as I often find these .alg files as examples, I think it would be good if it declares facet.source (to random) explicitly.

  • OpenTaxonomyReaderTask: I see that since PerfRunData incRef() the incoming taxonomy, you decRef(). I also see that setIndexReader behaves the same way. But I find it confusing. Personally, since this is not an application, I don't think we should 'hold a reference to IR/LTR just in case the one who set it closes it'. But if we do that, can we at least document on setIR/LTR that this is the case? I can certainly see myself opening IR/LTR, setting on PerfRunData without decRef()/close(). It would not occur to me that I should ...

  • The abstraction of ItemSource is nice. But it's jdocs still contain content.source.. Since we're not committed to backwards compatibility in benchmark, and in the interest of clarity, perhaps we should rename them to item.source.?

  • ItemSource.resetInputs has a @SuppressWarnings("unused") – is it a leftover from when it was private?

  • In PerfRunData ctor you do a Class.forName using the String name of RandomFacetSource. Why not use RandomFacetSource.class.getName()?

Looks very good. Now with FacetSource we can generate facets per the case we want to test (dense hierarchies, Zipf'ian ...)

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Thanks for reviewing Shai!

  • facets.alg: I agree, and added that specific facet source setting. With also modified the algorithm to have several rounds, some with facets, some without, so the effect of adding facets on performance is measured.

  • PFD.readers.incRef() - once the task (e.g. OpenReaderTask) opened the reader that task is gone and the reader really is maintained by the PFD, therefore IMO it is important that the PFD maintains its ref count as it does today. I agree that more documentation for this can help and added something in this spirit to both setReader()'s:

  /**
   * Set the taxonomy reader. Takes ownership of that taxonomy reader, that is,
   * internally performs taxoReader.incRef(). 
   * `@param` indexReader The indexReader to set.
   */
  public synchronized void setTaxonomyReader(TaxonomyReader taxoReader) throws IOException {

  }

  /**
   * Set the index reader. Takes ownership of that index reader, that is,
   * internally performs indexReader.incRef(). 
   * `@param` indexReader The indexReader to set.
   */
  public synchronized void setIndexReader(IndexReader indexReader) throws IOException {
  }
  • ItemSource and config props "source.xyz" - I am afraid there might be algorithms out there that will silently stop working, as the props names are not safe types or something like that. So I would rather to not alter these property names. Instead, following your comment, I renamed ItemSource to ContentItemsSource which is still extended by both ContentSource and FacetSource. So it is not that bad now that the property names start with "content." (these properties apply to all content-item-sources - that is both doc-content and facets).

  • ItemSource.resetInputs @SuppressWarnings("unused") - this is for the IOException which is not really thrown by that code (though it might be thrown by subclasses). Perhaps just a misconfiguration of my IDE? (I assume you have that warning disabled or stg?)

  • PerfRunData Class.forName on String rather than class.getName() - this is in fact the code pattern for all reflection actions in PerfRunData, so I just repeated for the new code whatever was there for the existing code. However, in fact, I thought about the same thing when adding this code. I think the difference is that if config contains a specification of another class, the default class would not even be loaded when using strings, but it will be loaded (though not used) when using Class.getName(). It is not a big performance issue, but it does seem "cleaner" to not load classes that are not going to be used. In any case, if we would like to change that, should revisit all the .forName(String) of default impls in the benchmark (and I think there are a few) and it seems to not belong to this issue.

Thanks for the review, working on a new patch - there are several copy/paste errors in the code where a CloseTaxonomyReader by mistake sets the PFD IndexReader to null...

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

ItemSource.resetInputs

I don't have that warning turned on in Eclipse. I disabled it for exactly this reason :).

ItemSource rename

The new name is ok, and the properties better fit it. BTW, if you wanted to have the .algs out there to not silently fail, you could add some code to setConfig that checks for these outdated properties, and throw a proper exception. But I'm ok with the solution you chose.

PFD.readers.incRef()

The javadocs are good. I'd also add "<b>NOTE:</b> if you no longer need that IndexReader/TaxoReader, you should decRef()/close() after calling this method". Otherwise, the IR/TR will just stay open ...

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Updated patch with a test, more javadocs, and a comment as Shai suggested.

I think this is ready to commit.

More tests are needed, and also Search with facets is missing, but that can go in a separate issue.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

I think this is ready to commit.

+1. Perhaps just add a CHANGES entry?

but that can go in a separate issue

I think it's better if we resolve it in that issue, and maybe rename the issue to "Facet benchmarking framework". You can still commit the current progress because it is 'whole' - covering the indexing side. I've worked on issues before that had several commits, so this will not be the first one.

We should also run some benchmark tests, describing clearly the data sets, but this can be done under a separate issue.

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

changes entry

Right, I always forget to include it in the patch, and add it only afterwords, should change that...

Also, I am not comfortable with the use of a config property in AddDocTask to tell that facets should be added. Seems too implicit to me, all of the sudden... So I think it would be better to refactor the doc creation in AddDoc into a method, and add AddFacetedDocTask that extends AddDoc and overrides the creation of the doc to be added, calling super, and then add the facets into it.

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Actually, since the doc is created at setup() it is sufficient to make the doc protected (was private). Also that with.facets property is useful for comparisons, so I kept it (now used only in AddFacetedDocTask) but modified its default to true.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Also that with.facets property is useful for comparisons

What do you mean? Someone can use AddFacetedDocTask w/ and w/o facets? What for? (sorry, but you didn't post a patch, so I cannot see what this is about)

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Someone can use AddFacetedDocTask w/ and w/o facets? What for?

It is useful for specifying the property like this:

with.facets=facets:true:false
...
{ "MAddDocs" AddFacetedDoc > : 400

and then getting in the report something like this:

------------> Report sum by Prefix (MAddDocs) and Round (4 about 4 out of 42)
Operation    round facets   runCnt   recsPerRun        rec/s  elapsedSec
MAddDocs_400     0   true        1          400       246.61        1.62
MAddDocs_400 -   1  false -  -   1 -  -  -  400 -   1,801.80 -  -   0.22
MAddDocs_400     2   true        1          400       412.80        0.97
MAddDocs_400 -   3  false -  -   1 -  -  -  400 -   2,139.04 -  -   0.19

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Updated patch according to Shai's comments and with AddFacetedDoc task.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Ahh, forgot about iterations. It does indeed look useful then.

Perhaps mention facet.source in AddFacetedDocTask?

I'm +1 for committing the current progress, but keep this issue open for the search side (to complete the framework).

@asfimport
Copy link
Author

Gilad Barkai (migrated from JIRA)

Doron, great patch!

I ran it and was somewhat surprised at the large overhead of the facet indexing. Digging deeper, I found the number of random facets to be 1-120 per document, with depth of 1-8. I believe those are overkill requirements. I reduced those to 1-20 per document with depth of 1-3 and got results I could live with.
Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers. What do you think?

Also, I changed the alg to consume the entire content source.

I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth. Depth seems more appropriate.

Other than that - I'm thrilled to have a working benchmark with facets - thanks!

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

Those number are scenario dependent but I think most cases I encountered are closer to my proposed numbers

I think we should maybe have a Wiki page or something with several .alg files that test different scenarios. I don't think that 1-120 is an example we shouldn't test. Rather, we should describe, either in a Wiki or a JIRA issue, the performance results for each scenario. And if the results are suspicious for a particular scenario, dig deeper and understand why.

So given that you know the numbers from above were run with that many facets per document, do the numbers make sense? Or you still think they're high?

I would suggest renaming max.facet.length (in the alg) & maxFacetLengh (in the code) to max.facet.depth and maxFacetDepth

+1.

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

I reduced those to 1-20 per document with depth of 1-3 and got results I could live with.

I agree, tried this too now and the comparison is more reasonable.
Perhaps what are reasonable numbers (for #facets/doc and their depth) is debatable, but I agree that 200 facets per document is too many.

Changing the defaults to 20/3 and preparing to commit.

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

Committed to 3x in r1180637, thanks Gilad!
Now porting to trunk, it is more involved than anticipated, because of contrib/modules differences.
Managed to make the tests pass, and the benchmark alg of choice to run.
However I noticed that in 3x that alg - when indexing reuters - added the entire collection, that is 21578 docs, while in trunk it only added about 400 docs. Might be something in my set-up, digging...

@asfimport
Copy link
Author

Doron Cohen (migrated from JIRA)

After manually removing benchmark/{work,temp} reuters collection was correctly extracted and in trunk the alg runs same as in 3x.
Committed to trunk: r1180674.
Keeping issue open to handle faceted search benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant