Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexReader.close should forcefully evict entries from FieldCache [LUCENE-2135] #3211

Closed
asfimport opened this issue Dec 8, 2009 · 25 comments

Comments

@asfimport
Copy link

asfimport commented Dec 8, 2009

Spinoff of java-user thread "heap memory issues when sorting by a string field".

We rely on WeakHashMap to hold our FieldCache, keyed by reader. But this lacks immediacy on releasing the reference, after a reader is closed.

WeakHashMap can't free the key until the reader is no longer referenced by the app. And, apparently, WeakHashMap has a further impl detail that requires invoking one of its methods for it to notice that a key has just become only weakly reachable.

To fix this, I think on IR.close we should evict entries from the FieldCache, as long as the sub-readers are truly closed (refCount dropped to 0).


Migrated from LUCENE-2135 by Michael McCandless (@mikemccand), resolved Jun 02 2010
Attachments: LUCENE-2135.patch (versions: 3)
Linked issues:

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

This is not unlike what we had to do in creating CloseableThreadLocal... that was another case where the underlying impl failed to free things as immediately as we'd like.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

A better approach is to don IR-keyed weakHashMaps completely and bind everything you need onto IR itself. That's how I do it and it works like a charm.

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Uwe Schindler (@uschindler) (migrated from JIRA)

#1906...

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

A better approach is to don IR-keyed weakHashMaps completely and bind everything you need onto IR itself. That's how I do it and it works like a charm.

That would be nice... I'd love to see a
Map<Object,Object> IndexReader.getInfo()
That was usable by anyone (not just the field cache) to associate stuff with a reader.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

I'd love to see a Map<Object,Object> IndexReader.getInfo()

I'm currently using - <T> T IndexReader.component(Class<T> key)
Plus a bundle of factories passed to IR on construction. Factories are called after IR is initialized, and also for child IRs and reopens. In case of reopens, besides new IR they are handed the component they produced for the current one (probably better just to pass old IR).

I can try to conjure a patch this weekend.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

I'm currently using - <T> T IndexReader.component(Class<T> key)

Thats much better than untyped. Same like AttributeSource.

+1 for a patch.

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Christian Kohlschütter (migrated from JIRA)

Please see #3209 for a refactoring of FieldCache, which also addresses these problems.

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Earwin Burrfoot (migrated from JIRA)

Please see #3209 for a refactoring of FieldCache, which also addresses these problems.

It doesn't address a problem of adding custom components to IR. It also does complicate IR beyond that unholy mess it already is.

I think it's better to have an ability to add 'any' kind of component to IR, and then implement whateverCaches over it.

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Christian Kohlschütter (migrated from JIRA)

I haven't followed the aforementioned discussion on the mailing list, but I think this issue covers a few things that are not mentioned explicitly here. Maybe it is a good idea to summarize the actual problems/challenges/benefits in a few sentences?

What I understand is that you plan to add arbitrary (cacheable) attributes to IndexReader. I suggest to move these features to the IndexCache proposed in #3209. Especially when using decorating IndexReaders (things like "ReadOnlyIndexReader") you would not want to store attributes separately from the decorated IndexReader. The same probably applies to SegmentReader with all its clones.

IndexCache would provide a common base for the extensions you mentioned. (i.e. you are welcome to apply your patches on top of LUCENE-2133).

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I would love to see a bigger solution here, but in the interim, I
think we should fix the current FieldCache (patch attached).

The patch adds FieldCache.purge to the interface. This is technically
a break in back-compat, to any external impls of FieldCache, but
that's such an insanely expert & difficult thing that I think it's
fine to make an exception.

A few tests (incl back-compat) needed fixing, because they were
closing the reader in-between to calls to getInts and then incorrectly
asserting the int[]'s were the same.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

In 2.9. we wrote in the BW section, that FieldCache interface is no BW problem as nobody ever can implement it (because the FileCacheImpl singleton is the only used one). Ok, you can implement it without any use. :-)

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

In my opinion, all IndexReaders should call purge, mabe put it on toplevel IR.close default impl? Because if you request FieldCache from Top-level (which you should not do, but you can), it should also be purged.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

all IndexReaders should call purge, mabe put it on toplevel IR.close default impl?

Hmm... this actually gets tricky to get right, because of the FieldCacheKey.

EG on closing a SegmentReader that's a clone of another, you don't want to evict it from the FieldCache.

I guess I could fix each of the IndexReader subclasses to evict themselves from the cache. Let me look into that...

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Hmm... this actually gets tricky to get right, because of the FieldCacheKey.

It's almost like we want two caches... one with entries that are independent of any changes in deleted docs (like the current FieldCache), and one that isn't.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, also evicts the other subclasses of IR from FieldCache.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

+1

I just noticed, it is even possible to retrieve a field cache from the FilterIndexReader and that would be a duplicate of the dlegate's cache. Very ugly.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

it is even possible to retrieve a field cache from the FilterIndexReader and that would be a duplicate of the dlegate's cache.

Yeah, not good. Should we default getFieldCacheKey to delegate? A subclass of FIR would presumably need to then override if their filtering altered what's in the field cache.

@asfimport
Copy link
Author

Uwe Schindler (@uschindler) (migrated from JIRA)

+1 Good idea, just add a note to the method javadocs, that you have to override this, if you change the contents by the filter.

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Christian Kohlschütter (migrated from JIRA)

Honestly, please have a look at #3209, I really think it is a good starting point to solve all these problems. Could we perhaps merge the two issues (LUCENE-2133 and LUCENE-2135)?

A quick summary of #3209:

The patch allows one or more IndexReaders to share common cache information (whatever this is), stored in the same "IndexCache" instance. The IndexCache is designed to contain any cacheable/volatile information that can be regenerated from the IndexReader.

For example: all clones of SegmentReader share the same SegmentReaderIndexCache with the original instance, containing the ThreadLocals of the "core reader".
By default (for all IndexReader classes) the IndexCache provides access to the "IndexFieldCache" (a non-static reimplementation of FieldCache).

To provide arbitrary cacheable objects we could now extend IndexCache by a simple HashMap (it does not need to be a WeakHashMap, since the IndexCache is closed and purged as soon as the original IndexReader is closed).

If you wish so, with the help of IndexCache we might even easily implement two different field caches for the same IndexCache instance, one that changes with deleted docs and another one that does not. Basically we may add any other kind of cache at a later point without touching IndexReader again. To re-use Earwin Burrfoot statement from above, that would then not "complicate IndexReader beyond that unholy mess it already is."

@asfimport
Copy link
Author

asfimport commented Dec 8, 2009

Michael McCandless (@mikemccand) (migrated from JIRA)

I definitely plan to have a look at #3209, but that's a rather large (and, good, on first read!) change to Lucene. I just don't think it should hold this small change up.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, adds override to FIR.getFieldCacheKey

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

To provide arbitrary cacheable objects we could now extend IndexCache by a simple HashMap (it does not need to be a WeakHashMap, since the IndexCache is closed and purged as soon as the original IndexReader is closed).

If I'm reading your patch right, to add something as a user to DirectoryReader+SegmentReader, I have to extend SegmentReaderIndexCache, IndexCache, then extend SegmentReader and DirectoryReader, and override all methods in DirectoryReader that create SegmentReader.

My aim is to be able to bind stuff to readers without overriding them, delegating, or touching in any manner except providing certain factories on creation.

@asfimport
Copy link
Author

Christian Kohlschütter (migrated from JIRA)

If I'm reading your patch right, to add something as a user to DirectoryReader+SegmentReader, I have to extend SegmentReaderIndexCache, IndexCache, then extend SegmentReader and DirectoryReader, and override all methods in DirectoryReader that create SegmentReader.

No, the functionality you intend to have is just not there at the moment. But it could be added directly to IndexCache (and thus, to all subclasses of IndexCache automatically).

My aim is to be able to bind stuff to readers without overriding them, delegating, or touching in any manner except providing certain factories on creation.

We could add getProperty/setProperty methods to IndexCache. You could then bind/get arbitrary objects as follows:

IndexReader ir = // somehow create your new reader
IndexCache cache = ir.getIndexCache();
Object someObject = cache.getProperty(someKey);
cache.setProperty("org.example.someCoolProperty", anotherObject);

(Personally, I prefer standardized string keys to avoid collisions, just like in the Servlet API, for example)

Again, this is not yet implemented, but could be done easily, without affecting any existing IndexReader or the other changes on FieldCache etc.

@asfimport
Copy link
Author

asfimport commented Dec 10, 2009

Christian Kohlschütter (migrated from JIRA)

There is some discussion in #3209 where we need a decision that also affects this issue. Could you please check and comment?

Thanks!
Christian

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

backport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment