-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complete overhaul of FieldCache API/Implementation [LUCENE-831] #1906
Comments
Chris M. Hostetter (@hossman) (migrated from JIRA) This is based on an idea i had a few months ago and was recently reminded of because of several mail threads about FieldCache .. so i started fleshing it out on the plane last week. I'm not entirely happy with it in it's current state, but I wanted to post it and see what people think of the overall approach. if people like the direction this is going, I would definitely appreciate some help with API critique and good unit tests (so far i've been relying solely on the existing Unit Tests to validate that i'm not breaking anything – but that doesn't really prove that the new APIs work the way they are intended to) TODO List
|
Otis Gospodnetic (@otisg) (migrated from JIRA) I haven't looked at the patch yet. However, I do know that a colleague of mine is about to start porting some FieldCache-based Filter stuff to Lucene's trunk. Because that work may conflict with Hoss' changes here, we should see if this get applied first. |
Chris M. Hostetter (@hossman) (migrated from JIRA) i haven't had any time to do further work on this issue ... partly because i haven't had a lot of time, but mainly because i'm hoping to get some feedback on the overall approach before any more serious effort investment. updated patch to work against the trunk (r544035) |
Mark Miller (@markrmiller) (migrated from JIRA) I think this patch is great. Not only does it make all of the sort caching logic much easier to decipher, being able to throw in a sophisticated cache manager like ehcache (which can let caches overflow to disk) could end up being pretty powerful. I am also very interested in the possibility of pre-warming a Reader. I will spend some time playing with what is here so far. |
Chris M. Hostetter (@hossman) (migrated from JIRA) thanks for the feedback mark ... i honestly haven't looked at this patch since the last time i updated the issue ... (i'm not sure if i've even thought about it once since then). it's the kind of things that seemed really cool important at the time, but then ... you know, other things come up. by all means, feel free to update it. as i recall, the biggest thing about this patch that was really just pie in the sky and may not make any sense is the whole concept of merging and letting subreaders of MultiReader do their own caching which could then percolate up. I did it on the assumption that it would come in handy when reopening an IndexReader that contains several segments – many of which may not have changed since the last time you opened the index. but i really didn't have any idea how the whole reopening things would work. i see now there is some reopen code in #1818, but frankly i'm still not sure wether the API makes sense, or is total overkill. it might be better to gut the merging logic from the patch and add it later if/when there becomes a more real use case for it (the existing mergeData and isMergable methods could always be re-added to the abstract base classes if it turns out they do make sense) |
Uwe Schindler (@uschindler) (migrated from JIRA) I did some extensive tests with Lucene 2.3 today and was wondering, that IndexReader.reopen() in combination with FieldCache/Sorting does not bring a performance increase. Until now, even if you reopen an Index using IndexReader.reopen(), you will get a new IndexReader instance which is not available in the default FieldCache. Because of that, all values from the cached fields must be reloaded. As reopen() only opens new/changed segments, a FieldCache implementation directly embedded into IndexReader as propsed by this issue would help here. Each segment would have its own FieldCache and reopening would be quite fast. As with Lucene 2.3 the reopen is possible, how about this issue? Would this be possible for 2.4? |
Michael Busch (migrated from JIRA)
Yeah, this is a limitation currently of reopen(). I'm planning to work on it after the 2.3 release is |
Mark Miller (@markrmiller) (migrated from JIRA) I spent a little time getting this patch somewhat updated to the trunk and running various benchmarks with reopen. As expected, the sequence of searching a large index with a sort, adding a few docs and then reopening a Reader to perform a sorted search, can be 10 of times faster. I think the new API is great too - I really like being able to experiment with other caching strategies. I find the code easier to follow as well. Did you have any ideas for merging a String index? That is something that still needs to be done...
|
Mark Miller (@markrmiller) (migrated from JIRA) Patch roughly moves things forward. I've added stuff for Long, Short, and Byte parsing, changed getAuto to a static cachkey method, switched core Lucene from using the FieldCache API to the new API, added some javadoc, and roughly have things going with the reopen method. In short, still a lot to do, but this advances things enough so that you can apply it to trunk and check out the goodness that this can bring to sorting and reopen.
|
Chris M. Hostetter (@hossman) (migrated from JIRA) Mark: I haven't looked at this issue or any of the code in my patch since my last updated, nor have i had a chance to look at your updated patch, but a few comments...
|
Mark Miller (@markrmiller) (migrated from JIRA) Right, I think its used in MultiSearcher and Parallel-MultSearcher and I believe its because you cannot compare simple ord arrays across Searchers and so you need the original sort term? I havn't been able to come up with anything that would be very efficient for merging a StringIndex, but I have not thought to much about it yet. Anyone out there have any ideas? Fastest way to merge two String[], each with an int[] indexing into the String[]? Is there a really fast way? I agree that it would be nice to skip the String[] if a MultiSearcher was not being used. I'll keep playing with it |
Yonik Seeley (@yonik) (migrated from JIRA) > I agree that it would be nice to skip the String[] if a MultiSearcher was not being used. If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) values. |
Michael McCandless (@mikemccand) (migrated from JIRA) One question here: should we switch to a method call, instead of a If we did that, then MultiSearchers would forward the request to the The benefit then is that reopen() of a reader would not have to We could always still provide a "give me the full array" fallback if |
Michael Busch (migrated from JIRA)
I agree, Mike. Currently during reopen() the MultiSegmentReader The method call might be a little slower than the array lookup, but |
Mark Miller (@markrmiller) (migrated from JIRA) >If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) >values. That is a great point. >should we switch to a method call, instead of a straight array, to retrieve a cached value for a doc? Sounds like a great idea to me. Solves the StringIndex merge and eliminates all merge costs at the price of a method call per access. |
Mark Miller (@markrmiller) (migrated from JIRA) Hmm...how do we avoid having to pull the cached field values through a sync on every call? The field data has to be cached...and the method to return the single cached field value has to be multi-threaded... |
Michael McCandless (@mikemccand) (migrated from JIRA) I think if we can finally move to having read-only IndexReaders then they would not sync on this method? Also, we should still provide the "give me the full array as of right now" fallback which in a read/write usage would allow you to spend lots of RAM in order to not synchronize. Of course you'd also have to update your array (or, periodically ask for a new one) if you are altering fields. |
Mark Miller (@markrmiller) (migrated from JIRA) Here is a quick proof-of-concept type patch for using a method call rather than arrays. Speed pertaining to reopen. In my quick test of 'open 500000 tiny docs index, repeat(3): add couple docs/sort search' the total time taken was: Orig FieldCache impl: 27 seconds Its kind of a worse case scenerio, but much faster is much faster<g> The bench does not push through the point where method 3 would have to reload all of the segments, so that would affect it some...but method one is reloading all of the segments every single time... This approach keeps the original approach for those that want to use the arrays. In that case everything still merges except for the StringIndex, so String sorting is slow. Lucene core is rigged to use the new method call though, so String sort is as sped up as the other field types when not using the arrays. Not sure everything is completely on the level yet, but all core tests pass (core sort tests can miss a lot). I lied about changing all core to use the new api...I havn't changed the function package yet.
|
Mark Miller (@markrmiller) (migrated from JIRA) Another push forward:
|
Mark Miller (@markrmiller) (migrated from JIRA) -Further refinements and code changes. |
Michael Busch (migrated from JIRA) I'm not sure how soon I'll have time to work on this; I don't want to block progress here, so I'm unassigning it for now in case someone else has time. |
Mark Miller (@markrmiller) (migrated from JIRA) Brings this patch back up to trunk level. |
Mark Miller (@markrmiller) (migrated from JIRA) Deprecating the function package is problematic. Doing it by method leaves the classes open to issues maintaining two possible states that could be mixed - the FieldCache and parsers are used as method params - you can't really replace the old internal representation with the new one. You really have to support the old method and new method and tell people not to use both or something. Doing it by class means duplicating half a dozen classes with new names. Neither is very satisfying. This stuff is all marked experimental and subject to change though - could it just be done clean sweep style? A little abrupt but...experimental is experimental <g> |
Fuad Efendi (migrated from JIRA) Would be nice to have TermVectorCache (if term vectors are stored in the index) |
Mark Miller (@markrmiller) (migrated from JIRA) That patch may have a goof , I'll peel off another soon. Unfortunately, it looks like this is slower than the orig implementation. I have to run more benchmarks, but it might even be in the 10-20% mark. My guess is that its the method calls - you may gain much faster reopen, but you appear to lose quite a bit on standard sort speed... Could give the choice of going with the arrays, if they prove as fast as orig, but then back to needing an efficient StringIndex merge... |
Mark Miller (@markrmiller) (migrated from JIRA) I've got the function package happily deprecated with replacement methods. I am going to ballpark the sort speed with method calls to be about 10% slower, but there are a lot of variables and I am still benchmarking. I've added a check for a system property so that by default, sorting will still use primitive arrays, but if you want to pay the small sort cost you can turn on the method calls and get faster reopen without cache merging. So that wraps up everything I plan to do unless comments, criticisms bring up anything else. The one piece that is missing and should be addressed is an efficient merge for the StringIndex. Patch to follow.
|
Mark Miller (@markrmiller) (migrated from JIRA) Here is the patch - plenty to look over. The method access might not be as much slower as I thought - might be closer to a couple to 5% than the 10% I first guessed. |
Mark Miller (@markrmiller) (migrated from JIRA)
main cache, then at least the same classes in a private cache) What benefit do you see to this? Does it offer anything over the simple map used now? |
Chris M. Hostetter (@hossman) (migrated from JIRA)
I think what i ment there was that if the norm caching used the same functionality then it could simplify the code, and norms would be optimized in the reopen case as well ... plus custom Cache Impls could report stats on norm usage (just like anything else) so people could see when norms were getting used for fields they didn't expect them to be. But as i've said before ... most of my early comments were pie in the sky theoretical brainstorming comments – don't read too much into them if they don't really make sense. |
Mark Miller (@markrmiller) (migrated from JIRA) Updated to trunk. Took out an optimization last patch - was using ordinals rather than strings to sort when Reader was not Multi - didn't like the isMulti on IndexReader though, so this patch and the last don't have it. |
Mark Miller (@markrmiller) (migrated from JIRA) I won't likely be getting to this anytime soon if someone else wants to work on it. I'll get back at it at some point if not though. I believe the latest patch is a nice base to work from. I'm still not clear to me if its best to start merging using the ValueSource somehow, or do something where the ValueSource has a merge implementation (allowing for a more efficient private merge). It seems the merge code for fields, norms, dels, is fairly specialized now, but could become a bit more generic. Then perhaps you could add any old ValueSource (other than norms, fields, dels) and easily hook into the merge process. Maybe even in RAM merges of RAM based ValueSources - FieldCache etc. Of course, I guess you could also still do things specialized as now, and just provide access to the files through a ValueSource. That really crimps the pluggability though. The next step (in terms of the current patch) seems to be to start working ValueSource into norms, dels, possibly stored fields. Eventually they should become pluggable, but I'm not sure how best to plug them in. I was thinking you could set a default ValueSource by field for the FieldCache using the Reader open method with a new param. Perhaps it should take a ValueSourceFactory that can provide a variety of ValueSources based on field, norms, dels, stored fields, with variations for read-only? The proposed componentization of IndexReader could be another approach if it materializes, or worked into this issue. I don't think I'll understand whats needed for updatability until I'm in deeper. It almost seems like something like setInt(int doc, int n), setByte(int doc, byte b) on the ValueSource might work. They could possibly throw Unsupported. I know there are a lot of little difficulties involved in all of this though, so I'm not very sure of anything at the moment. The backing impl would be free to update in RAM (say synced dels), or do a copy on write, etc. I guess all methods would throw Unsupported by default, but if you override a getXXX you would have the option of overriding a setXXX. ValueSources also need the ability to be sharable across IndexReaders with the ability to do copy on write if they are shared and updatable. |
Steven Rowe (@sarowe) (migrated from JIRA) Bulk move 4.4 issues to 4.5 and 5.0 |
Uwe Schindler (@uschindler) (migrated from JIRA) Move issue to Lucene 4.9. |
Motivation:
a) eliminate global static map keyed on IndexReader (thus
eliminating synch block between completley independent IndexReaders)
b) allow more customization of cache management (ie: use
expiration/replacement strategies, disk backed caches, etc)
c) allow people to define custom cache data logic (ie: custom
parsers, complex datatypes, etc... anything tied to a reader)
d) allow people to inspect what's in a cache (list of CacheKeys) for
an IndexReader so a new IndexReader can be likewise warmed.
e) Lend support for smarter cache management if/when
IndexReader.reopen is added (merging of cached data from subReaders).
the new implementation, so there is no redundent caching as client code
migrades to new API.
Migrated from LUCENE-831 by Chris M. Hostetter (@hossman), 5 votes, updated May 09 2016
Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff (versions: 2), LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch (versions: 14), LUCENE-831-trieimpl.patch
Linked issues:
The text was updated successfully, but these errors were encountered: