Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete overhaul of FieldCache API/Implementation [LUCENE-831] #1906

Open
asfimport opened this issue Mar 14, 2007 · 162 comments
Open

Complete overhaul of FieldCache API/Implementation [LUCENE-831] #1906

asfimport opened this issue Mar 14, 2007 · 162 comments

Comments

@asfimport
Copy link

asfimport commented Mar 14, 2007

Motivation:

  1. Complete overhaul the API/implementation of "FieldCache" type things...
    a) eliminate global static map keyed on IndexReader (thus
    eliminating synch block between completley independent IndexReaders)
    b) allow more customization of cache management (ie: use
    expiration/replacement strategies, disk backed caches, etc)
    c) allow people to define custom cache data logic (ie: custom
    parsers, complex datatypes, etc... anything tied to a reader)
    d) allow people to inspect what's in a cache (list of CacheKeys) for
    an IndexReader so a new IndexReader can be likewise warmed.
    e) Lend support for smarter cache management if/when
    IndexReader.reopen is added (merging of cached data from subReaders).
  2. Provide backwards compatibility to support existing FieldCache API with
    the new implementation, so there is no redundent caching as client code
    migrades to new API.

Migrated from LUCENE-831 by Chris M. Hostetter (@hossman), 5 votes, updated May 09 2016
Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, fieldcache-overhaul.diff (versions: 2), LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, LUCENE-831.03.31.2008.diff, LUCENE-831.patch (versions: 14), LUCENE-831-trieimpl.patch
Linked issues:

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

This is based on an idea i had a few months ago and was recently reminded of because of several mail threads about FieldCache .. so i started fleshing it out on the plane last week.

I'm not entirely happy with it in it's current state, but I wanted to post it and see what people think of the overall approach.

if people like the direction this is going, I would definitely appreciate some help with API critique and good unit tests (so far i've been relying solely on the existing Unit Tests to validate that i'm not breaking anything – but that doesn't really prove that the new APIs work the way they are intended to)

TODO List

  • lots of little :TODO:s in code, mainly about javadocs
  • add merge methods to StringIndexCacheKey
    (a bit complicated, but should be possible/efficient)
  • figure out if there is any better way of dealing with
    SortComparatorCacheKey and the visibility issues of
    SortComparator.getComparable
  • change norm caching to use new caches (if not the same
    main cache, then at least the same classes in a private cache)
  • write an ass load more tests
  • is there a better way to deal with merging then to pass starts[] ?
    (pass a new datastructure encapsulating starts/subReaders?)
  • CacheFactory seemed like a good idea initially so that MultiReader
    on a multi-segment index could cascade down, but what if people
    only want caching at the outermost level (regardless of wether
    the key is mergable) ... factory can't assuming anything if reader
    is not an instance of MultiReader
  • audit/change all core code using FieldCache to use new API
  • performance test that this doesn't hurt things in some way.

@asfimport
Copy link
Author

Otis Gospodnetic (@otisg) (migrated from JIRA)

I haven't looked at the patch yet. However, I do know that a colleague of mine is about to start porting some FieldCache-based Filter stuff to Lucene's trunk. Because that work may conflict with Hoss' changes here, we should see if this get applied first.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

i haven't had any time to do further work on this issue ... partly because i haven't had a lot of time, but mainly because i'm hoping to get some feedback on the overall approach before any more serious effort investment.

updated patch to work against the trunk (r544035)

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I think this patch is great. Not only does it make all of the sort caching logic much easier to decipher, being able to throw in a sophisticated cache manager like ehcache (which can let caches overflow to disk) could end up being pretty powerful. I am also very interested in the possibility of pre-warming a Reader.

I will spend some time playing with what is here so far.

@asfimport
Copy link
Author

asfimport commented Jul 19, 2007

Chris M. Hostetter (@hossman) (migrated from JIRA)

thanks for the feedback mark ... i honestly haven't looked at this patch since the last time i updated the issue ... (i'm not sure if i've even thought about it once since then). it's the kind of things that seemed really cool important at the time, but then ... you know, other things come up.

by all means, feel free to update it.

as i recall, the biggest thing about this patch that was really just pie in the sky and may not make any sense is the whole concept of merging and letting subreaders of MultiReader do their own caching which could then percolate up. I did it on the assumption that it would come in handy when reopening an IndexReader that contains several segments – many of which may not have changed since the last time you opened the index. but i really didn't have any idea how the whole reopening things would work. i see now there is some reopen code in #1818, but frankly i'm still not sure wether the API makes sense, or is total overkill.

it might be better to gut the merging logic from the patch and add it later if/when there becomes a more real use case for it (the existing mergeData and isMergable methods could always be re-added to the abstract base classes if it turns out they do make sense)

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I did some extensive tests with Lucene 2.3 today and was wondering, that IndexReader.reopen() in combination with FieldCache/Sorting does not bring a performance increase.

Until now, even if you reopen an Index using IndexReader.reopen(), you will get a new IndexReader instance which is not available in the default FieldCache. Because of that, all values from the cached fields must be reloaded. As reopen() only opens new/changed segments, a FieldCache implementation directly embedded into IndexReader as propsed by this issue would help here. Each segment would have its own FieldCache and reopening would be quite fast.

As with Lucene 2.3 the reopen is possible, how about this issue? Would this be possible for 2.4?

@asfimport
Copy link
Author

asfimport commented Jan 16, 2008

Michael Busch (migrated from JIRA)

As with Lucene 2.3 the reopen is possible, how about this issue? Would this be possible for 2.4?

Yeah, this is a limitation currently of reopen(). I'm planning to work on it after the 2.3 release is
out and #1662 is committed!

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I spent a little time getting this patch somewhat updated to the trunk and running various benchmarks with reopen. As expected, the sequence of searching a large index with a sort, adding a few docs and then reopening a Reader to perform a sorted search, can be 10 of times faster.

I think the new API is great too - I really like being able to experiment with other caching strategies. I find the code easier to follow as well.

Did you have any ideas for merging a String index? That is something that still needs to be done...

  • Mark

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Patch roughly moves things forward.

I've added stuff for Long, Short, and Byte parsing, changed getAuto to a static cachkey method, switched core Lucene from using the FieldCache API to the new API, added some javadoc, and roughly have things going with the reopen method.

In short, still a lot to do, but this advances things enough so that you can apply it to trunk and check out the goodness that this can bring to sorting and reopen.

  • Mark

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

Mark:

I haven't looked at this issue or any of the code in my patch since my last updated, nor have i had a chance to look at your updated patch, but a few comments...

  1. you rock.

  2. I have no idea what i had in mind for dealing with StringIndex when i said "a bit complicated, but should be possible/efficient". I do distinctly remember thinking that we should add support for just getting the string indexes (ie: the current int[numDocs]) for when you don't care what the strings are, just the sort order or just getting a String[numDocs] when you aren't doing sorting and just want an "inverted inverted index" on the field ... but obviously we still need to support the curent StringIndex (it's used in MultiSearcher right?)

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Right, I think its used in MultiSearcher and Parallel-MultSearcher and I believe its because you cannot compare simple ord arrays across Searchers and so you need the original sort term?

I havn't been able to come up with anything that would be very efficient for merging a StringIndex, but I have not thought to much about it yet. Anyone out there have any ideas? Fastest way to merge two String[], each with an int[] indexing into the String[]? Is there a really fast way?

I agree that it would be nice to skip the String[] if a MultiSearcher was not being used.

I'll keep playing with it

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

> I agree that it would be nice to skip the String[] if a MultiSearcher was not being used.

If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) values.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

One question here: should we switch to a method call, instead of a
straight array, to retrieve a cached value for a doc?

If we did that, then MultiSearchers would forward the request to the
right IndexReader.

The benefit then is that reopen() of a reader would not have to
allocate & bulk copy massive arrays when updating the caches. It
would keep the cost of reopen closer to the size of the new segments.
And this way the old reader & the new one would not double-allocate
the RAM required to hold the common parts of the cache.

We could always still provide a "give me the full array" fallback if
people really wanted that (and were willing to accept the cost).

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

The benefit then is that reopen() of a reader would not have to
allocate & bulk copy massive arrays when updating the caches. It
would keep the cost of reopen closer to the size of the new segments.

I agree, Mike. Currently during reopen() the MultiSegmentReader
allocates a new norms array with size maxDoc(), which is, as you said,
inefficient if only some (maybe even small) segments changed.

The method call might be a little slower than the array lookup, but
I doubt that this would be very significant. We can make this change for
the norms and run performance tests to measure the slowdown.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

>If you're going to incrementally update a FieldCache of a MultiReader, it's the same issue... can't merge the ordinals without the original (String) >values.

That is a great point.

>should we switch to a method call, instead of a straight array, to retrieve a cached value for a doc?

Sounds like a great idea to me. Solves the StringIndex merge and eliminates all merge costs at the price of a method call per access.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Hmm...how do we avoid having to pull the cached field values through a sync on every call? The field data has to be cached...and the method to return the single cached field value has to be multi-threaded...

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think if we can finally move to having read-only IndexReaders then they would not sync on this method?

Also, we should still provide the "give me the full array as of right now" fallback which in a read/write usage would allow you to spend lots of RAM in order to not synchronize. Of course you'd also have to update your array (or, periodically ask for a new one) if you are altering fields.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Here is a quick proof-of-concept type patch for using a method call rather than arrays. Speed pertaining to reopen.

In my quick test of 'open 500000 tiny docs index, repeat(3): add couple docs/sort search' the total time taken was:

Orig FieldCache impl: 27 seconds
New impl with arrays: 12 seconds
New impl with method call: 3 seconds

Its kind of a worse case scenerio, but much faster is much faster<g> The bench does not push through the point where method 3 would have to reload all of the segments, so that would affect it some...but method one is reloading all of the segments every single time...

This approach keeps the original approach for those that want to use the arrays. In that case everything still merges except for the StringIndex, so String sorting is slow. Lucene core is rigged to use the new method call though, so String sort is as sped up as the other field types when not using the arrays.

Not sure everything is completely on the level yet, but all core tests pass (core sort tests can miss a lot).

I lied about changing all core to use the new api...I havn't changed the function package yet.

  • Mark

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Another push forward:

  • Comparators are cached again (left it out before to think about).
  • Lots of new JavaDoc
  • Naming, usage refinements
  • Still doesn't touch the norms.
  • Cleanup, fixup, finishup, type stuff.
  • Deprecated some function package methods that used FieldCache, but still need alternatives.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

-Further refinements and code changes.
-Fixed ord comparisons across IndexReaders - as yonik pointed out. Standard sort tests didn't catch and I missed even with the reminder.
-Added a bunch of CacheKey unit tests.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I'm not sure how soon I'll have time to work on this; I don't want to block progress here, so I'm unassigning it for now in case someone else has time.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Brings this patch back up to trunk level.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Deprecating the function package is problematic. Doing it by method leaves the classes open to issues maintaining two possible states that could be mixed - the FieldCache and parsers are used as method params - you can't really replace the old internal representation with the new one. You really have to support the old method and new method and tell people not to use both or something. Doing it by class means duplicating half a dozen classes with new names. Neither is very satisfying. This stuff is all marked experimental and subject to change though - could it just be done clean sweep style? A little abrupt but...experimental is experimental <g>

@asfimport
Copy link
Author

Fuad Efendi (migrated from JIRA)

Would be nice to have TermVectorCache (if term vectors are stored in the index)

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

That patch may have a goof , I'll peel off another soon. Unfortunately, it looks like this is slower than the orig implementation. I have to run more benchmarks, but it might even be in the 10-20% mark. My guess is that its the method calls - you may gain much faster reopen, but you appear to lose quite a bit on standard sort speed...

Could give the choice of going with the arrays, if they prove as fast as orig, but then back to needing an efficient StringIndex merge...

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I've got the function package happily deprecated with replacement methods.

I am going to ballpark the sort speed with method calls to be about 10% slower, but there are a lot of variables and I am still benchmarking.

I've added a check for a system property so that by default, sorting will still use primitive arrays, but if you want to pay the small sort cost you can turn on the method calls and get faster reopen without cache merging.

So that wraps up everything I plan to do unless comments, criticisms bring up anything else.

The one piece that is missing and should be addressed is an efficient merge for the StringIndex.

Patch to follow.

  • mark

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Here is the patch - plenty to look over.

The method access might not be as much slower as I thought - might be closer to a couple to 5% than the 10% I first guessed.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

change norm caching to use new caches (if not the same

main cache, then at least the same classes in a private cache)

What benefit do you see to this? Does it offer anything over the simple map used now?

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

What benefit do you see to this? Does it offer anything over the simple map used now?

I think what i ment there was that if the norm caching used the same functionality then it could simplify the code, and norms would be optimized in the reopen case as well ... plus custom Cache Impls could report stats on norm usage (just like anything else) so people could see when norms were getting used for fields they didn't expect them to be.

But as i've said before ... most of my early comments were pie in the sky theoretical brainstorming comments – don't read too much into them if they don't really make sense.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

Updated to trunk.

Took out an optimization last patch - was using ordinals rather than strings to sort when Reader was not Multi - didn't like the isMulti on IndexReader though, so this patch and the last don't have it.

@asfimport
Copy link
Author

Mark Miller (@markrmiller) (migrated from JIRA)

I won't likely be getting to this anytime soon if someone else wants to work on it. I'll get back at it at some point if not though.

I believe the latest patch is a nice base to work from.

I'm still not clear to me if its best to start merging using the ValueSource somehow, or do something where the ValueSource has a merge implementation (allowing for a more efficient private merge). It seems the merge code for fields, norms, dels, is fairly specialized now, but could become a bit more generic. Then perhaps you could add any old ValueSource (other than norms, fields, dels) and easily hook into the merge process. Maybe even in RAM merges of RAM based ValueSources - FieldCache etc. Of course, I guess you could also still do things specialized as now, and just provide access to the files through a ValueSource. That really crimps the pluggability though.

The next step (in terms of the current patch) seems to be to start working ValueSource into norms, dels, possibly stored fields. Eventually they should become pluggable, but I'm not sure how best to plug them in. I was thinking you could set a default ValueSource by field for the FieldCache using the Reader open method with a new param. Perhaps it should take a ValueSourceFactory that can provide a variety of ValueSources based on field, norms, dels, stored fields, with variations for read-only? The proposed componentization of IndexReader could be another approach if it materializes, or worked into this issue.

I don't think I'll understand whats needed for updatability until I'm in deeper. It almost seems like something like setInt(int doc, int n), setByte(int doc, byte b) on the ValueSource might work. They could possibly throw Unsupported. I know there are a lot of little difficulties involved in all of this though, so I'm not very sure of anything at the moment. The backing impl would be free to update in RAM (say synced dels), or do a copy on write, etc. I guess all methods would throw Unsupported by default, but if you override a getXXX you would have the option of overriding a setXXX.

ValueSources also need the ability to be sharable across IndexReaders with the ability to do copy on write if they are shared and updatable.

@asfimport
Copy link
Author

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

This was referenced Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant