Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexReader.reopen() [LUCENE-743] #1818

Closed
asfimport opened this issue Dec 11, 2006 · 95 comments
Closed

IndexReader.reopen() [LUCENE-743] #1818

asfimport opened this issue Dec 11, 2006 · 95 comments

Comments

@asfimport
Copy link

asfimport commented Dec 11, 2006

This is Robert Engels' implementation of IndexReader.reopen() functionality, as a set of 3 new classes (this was easier for him to implement, but should probably be folded into the core, if this looks good).


Migrated from LUCENE-743 by Otis Gospodnetic (@otisg), 3 votes, resolved Nov 17 2007
Attachments: IndexReaderUtils.java, lucene-743.patch (versions: 3), lucene-743-take10.patch, lucene-743-take2.patch, lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, lucene-743-take6.patch, lucene-743-take7.patch, lucene-743-take8.patch, lucene-743-take9.patch, MyMultiReader.java, MySegmentReader.java, varient-no-isCloneSupported.BROKEN.patch
Linked issues:

@asfimport
Copy link
Author

Otis Gospodnetic (@otisg) (migrated from JIRA)

In a direct email to me, Robert said: "All of the files can be prepended with the ASL."

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

A generic version probably needs to implement reference counting on the Segments or IndexReader in order to know when they can be safely closed.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

i somehow missed seeing this issues before ... i don't really understand the details, but a few comments that come to mind...

  1. this approach seems to assume that when reopening a MyMultiReader, the sub readers will all be MySegmentReaders .. assuming we generalize this to MultiReader/SegmentTeader, this wouldn't work in the case were people are using a MultiReader containing other MultiReaders ... not to mention the possibility of people who have written their own IndexReader implementations.
    in generally we should probably try to approach reopening a reader as a recursive operation if possible where each type of reader is responsible for checking to see if it's underlying data has changed, if not return itself, if so return a new reader in it's place (much like rewrite works for Queries)

  2. there is no more commit lock correct? ... is this approach something that can still be valid using the current backoff/retry mechanism involved with opening segments?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> i somehow missed seeing this issues before ...

actually, me too... first I came across this thread:

http://www.gossamer-threads.com/lists/lucene/java-dev/31592?search_string=refresh;#31592

in which Doug suggested adding a static method IndexReader.open(IndexReader)
which would either return the passed in IndexReader instance in case is is
up to date or return a new, refreshed instance.

I started implementing this, using Dougs and Roberts ideas and then realized
that there was already this open issue. But the code here is quite outdated.

> in generally we should probably try to approach reopening a reader as a
> recursive operation

Yeah we could do that. However, this might not be so easy to implement.
For example, if a user creates a MultiReader instance and adds whatever
subreaders, we would have to recursively refresh the underlying readers.
But if the MultiReader was created automatically by IndexReader.open() just
calling refresh on the subreaders is not enough. New SegmentReaders have to
be opened for new segments.

Also the recursive walk would have to take place within the FindSegmentsFile
logic.

I decided therefore to only allow IndexReaders to be refreshed if they were
created by one of the IndexReader.open() methods. I'm going to submit a first
version of my patch soon. Do you think this is too limiting?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

First version of my patch:

  • Adds the static method IndexReader.open(IndexReader)
    that returns a new instance of IndexReader in case
    the reader could be updated. If it was up to date
    then the passed-in instance is returned. Only
    IndexReader instances that were created by one of
    the IndexReader.open() methods can be refreshed.

  • SegmentsReader.reopen(SegmentInfos) looks in the
    SegmentInfos for a segment with the same name. If
    one could be found then either the deleted docs or
    the norms were updated, otherwise the segment name
    would have changed. reopen() clones the
    SegmentReader and either loads the deleted docs,
    the norms or both. Then the clone is returned and
    the original SegmentReader is marked as closed.

  • If the index has only one segment, then
    IndexReader.open(IndexReader) checks if the passed
    in IndexReader can be refreshed. This is only
    possible if it is no MultiReader and if the segment
    name has not changed. Otherwise a new SegmentReader
    instance is returned and the old reader is closed.

  • If the index has multiple segments, then
    IndexReader.open(IndexReader) creates a new
    MultiReader and tries to reuse the passed-in
    reader (in case it's a SegmentReader) or its
    subreaders (in case it's a MultiReader). For new
    segments new SegmentReaders are created. Old
    readers that couldn't be reused are closed.

  • Adds the new testcase TestIndexReaderReopen. It
    includes the method
    assertIndexEquals(IndexReader, IndexReader) that
    checks whether boths IndexReaders have the same
    content. The testcase creates an index and
    performes different modifications on that index.
    One IndexReader is refreshed after each index
    modification and compared to a fresh IndexReader
    which is opened after each modification.

This first version is for review and not ready to
commit. I want to add more extensive tests and
probably clean up the code.

All tests pass.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

> Yeah we could do that. However, this might not be so easy to implement.
> For example, if a user creates a MultiReader instance and adds whatever
> subreaders, we would have to recursively refresh the underlying readers.
> But if the MultiReader was created automatically by IndexReader.open() just
> calling refresh on the subreaders is not enough. New SegmentReaders have to
> be opened for new segments.

...this being the curse that is MultiReader – it can serve two very differenet purposes.

You seem to have already solved the multisegments in a single directory approach, the MultiReader over many subreader part actually seems much easier to me (just call your open method on all of the subreaders) the only tricky part is detecting which behavior should be used when. This could be driven by a simple boolean property of MultiReader indicating whether it owns it's directory and we need to look for new segments or not – in which case we just need to refresh the subreaders. (My personal preference would be to change MultiReader so "this.directory" is null if it was open over several other subReaders, right now it's just assigned to the first one arbitrarily, but there may be other consequences of changing that)

Incidentally: I don't think it's crucial that this be done as a recursive method. the same approach i describe could be added to static utility like what you've got, I just think that if it's possible to do it recursively we should so that if someone does write their own MultiReader or SegmentReader subclass they can still benefit from any core reopening logic as long as theey do their part to "reopen" their extensions.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

an extremely hackish refactoring of the previous patch that demonstrates the method working recursively and dealing with MultiReaders constructed over multiple subReaders.

a few notes:

  1. no where near enough tests of the subReader situation
  2. the refactoring is very very ugly and brutish ... most of the code is still in IndexReader just because it needs so many things that are private – things that really seems like they should be pushed down into SegmentReader (or made protected)
  3. test triggered an apparent NPE in MultiReader.isOptimized() when there are subReaders, i hacked arround this in the test, see usages of assertIndexEqualsZZZ vs assertIndexEquals
  4. the FilterIndexReader situation is ... interesting. in theory FilterIndexReader should really be abstract (since if you aren't subclassing it anyway, why do you want it?)

@asfimport
Copy link
Author

asfimport commented Aug 1, 2007

Michael Busch (migrated from JIRA)

Now, after #1856, #2046 and #1907 are committed, I updated the latest
patch here, which was now easier because MultiReader is now separated into two classes.

Notes:

  • As Hoss suggested I added the reopen() method to IndexReader non-static.
  • MultiReader and ParallelReader now overwrite reopen() to reopen the subreaders
    recursively.
  • FilteredReader also overwrites reopen(). It checks if the underlying reader has
    changed, and in that case returns a new instance of FilteredReader.

I think the general contract of reopen() should be to always return a new IndexReader
instance if it was successfully refreshed and return the same instance otherwise,
because IndexReaders are used as keys in caches.
A remaining question here is if the old reader(s) should be closed then or not.
This patch closes the old readers for now, if we want to change that we probably have
to add some reference counting mechanism, as Robert suggested already. Then I would
also have to change the SegmentReader.reopen() implementation to clone resources like
the dictionary, norms and delete bits.
I think closing the old reader is fine. What do others think? Is keeping the old
reader after a reopen() a useful usecase?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I ran some quick performance tests with this patch:

  1. The test opens an IndexReader, deletes one document by random docid, closes the Reader.
    So this reader doesn't have to open the dictionary or the norms.
  2. Another reader is opened (or alternatively reopened) and one TermQuery is executed, so
    this reader has to read the norms and the dictionary.

I run these two steps 5000 times in a loop.

First run: Index size: 4.5M, optimized

      • TermQuery: 103 sec
        1. (open): 806 sec, so open() takes 703 sec
        1. (reopen): 118 sec, so reopen() takes 15 sec ==> Speedup: 46.9 X

Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)

      • TermQuery: 235 sec
        1. (open): 1162 sec, so open() takes 927 sec
        1. (reopen): 321 sec, so reopen() takes 86 sec ==> Speedup: 10.8X

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

> I think closing the old reader is fine. What do others think? Is keeping the old
> reader after a reopen() a useful usecase?

In a multi-threaded environment, one wants to open a new reader, but needs to wait until all requests finish before closing the old reader. Seems like reference counting is the only way to handle that case.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

(note: i haven't looked at the latest patch in detail, just commenting on the comments)

One key problem i see with automatically closing things in reopen is MultiReader: it's perfectly legal to do something like this psuedocode...

IndexReader a, b, c = ...
MultiReader ab = new MultiReader({a, b})
MultiReader bc = new MultiReader({b, c})
...b changes on disk...
ab.reopen(); // this shouldn't affect bc;

one possibility would be for the semantics of reopen to close old readers only if it completely owns them; ie: MultiReader should never close anything in reopen, MultiSegmentReader should close all of the subreaders since it opened them in the first place ... things get into a grey area with SegementReader though.

In general i think the safest thing to do is for reopen to never close. Yonik's comment showcases one of the most compelling reasons why it can be important for clients to be able to keep using an old IndexReader instance, and it's easy enough for clients that want the old one closed to do something like...

IndexReader r = ...
...
IndexReader tmp = r.reopen();
if (tmp != r) r.close();
r = tmp;
...

(one question that did jump out at me while greping the patch for the where old readers were being closed: why is the meat of reopen still in "IndexReader" with a "if (reader instanceof SegmentReader)" style logic in it? can't the different reopen mechanisms be refactored down into SegmentReader and MultiSegmentReader respectively? shouldn't the default impl of IndexReader throw an UnsuppportedOperationException?)

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> IndexReader a, b, c = ...
> MultiReader ab = new MultiReader({a, b})
> MultiReader bc = new MultiReader({b, c})
> ...b changes on disk...
> ab.reopen(); // this shouldn't affect bc;
>
> one possibility would be for the semantics of reopen to close old readers only
> if it completely owns them; ie: MultiReader should never close anything in
> reopen, MultiSegmentReader should close all of the subreaders since it opened
> them in the first place

So if 'b' in your example is a MultiSegmentReader, then the reopen() call
triggered from MultiReader.reopen() would close old readers, because it owns them,
thus 'bc' wouldn't work anymore. So it depends on the caller of
MultiSegmentReader.reopen() whether or not to close the subreaders. I think this
is kind of messy. Well instead of reopen() we could add
reopen(boolean closeOldReaders), but then...

> IndexReader r = ...
> ...
> IndexReader tmp = r.reopen();
> if (tmp != r) r.close();
> r = tmp;
> ...

... is actually easy enough as you pointed out, so that the extra complexity is not
really worth it IMO.

> In general i think the safest thing to do is for reopen to never close.

So yes, I agree.

> why is the meat of reopen still in "IndexReader" with a "if (reader instanceof
> SegmentReader)" style logic in it? can't the different reopen mechanisms be
> refactored down into SegmentReader and MultiSegmentReader respectively?

I'm not sure if the code would become cleaner if we did that. Sometimes a
SegmentReader would then have to return a MultiSegmentReader instance and vice
versa. So we'd probably have to duplicate some of the code in these two classes.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

> I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to
> return a MultiSegmentReader instance and vice versa. So we'd probably have to duplicate some of the code in
> these two classes.

i don't hink there would be anything wrong with SegmentReader.reopen returning a MultiSegmentReader in some cases (or vice versa) but it definitely seems wrong to me for a parent class to be explicitly casing "this" to one of two know subclasses ... making reopen abstract in the base class (or throw UnsupportOp if for API back compatibility) seems like the only safe way to ensure any future IndexReader subclasses work properly.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

We should first refactor segmentInfos into IndexReader's subclasses.

@asfimport
Copy link
Author

Testo Nakada (migrated from JIRA)

Please also consider making an option where the reopen can be automated (i.e. when the index is updated) instead of having to call it explicitly. Thread safety should be taken into account as well.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I'm attaching a new version of the patch that has a lot of changes compared to the last patch:

  • I factored most of the reopen logic into the subclasses of IndexReader. Now that we're having the DirectoryIndexReader layer this was possible in a more elegant way.

  • IndexReader.reopen() now does not close the old readers by default. This was somewhat tricky, because now the IndexReaders must be cloneable. I changed IndexReader to implement the Cloneable interface and implemented clone() for all Lucene built-in IndexReaders. However, there are probably custom IndexReader implementations out there that do not implement clone() and reopen() should throw an exception when an attempt is made to reopen such a reader. But I don't want to throw an exception in IndexReader.clone() itself, because then it would not be possible anymore for subclasses to recursively call the native Object.clone() via super.clone(). To solve this conflict I added the method
    {code:java}
    /**

  • Returns true, if this IndexReader instance supports the clone() operation.
    */
    protected boolean isCloneSupported();

  
to IndexReader which returns false by default. IndexReader.clone() checks if the actual implementation supports clone() (i. e. the above method returns true). If not, it throws an UnsupportedOperationException, if yes, it returns super.clone().

I was not sure about whether to throw an (unchecked) UnsupportedOperationException or a CloneNotSupportedException in this case. I decided to throw UnsupportedOperationException even though it's not really following the clone() guidelines, because it avoids the necessity to catch the CloneNotSupportedException every time clone() is called (in the reopen() methods of all IndexReader implementations).

As an example for how the clone() method is used let me describe how MultiReader.reopen() works: it tries to reopen every of its subreaders. If at least one subreader could be reopened successfully, then a new MultiReader instance is created and the reopened subreaders are added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no index changes) is now cloned() and added to the new MultiReader.

- I also added the new method 
{code:java}
  /**
   * In addition to {`@link` #reopen()} this methods offers the ability to close
   * the old IndexReader instance. This speeds up the reopening process for
   * certain IndexReader implementations and reduces memory consumption, because
   * resources of the old instance can be reused for the reopened IndexReader
   * as it avoids the need of copying the resources.
   * <p>
   * The reopen performance especially benefits if IndexReader instances returned 
   * by one of the <code>open()</code> methods are reopened with 
   * <code>closeOldReader==true</code>.
   * <p>
   * Certain IndexReader implementations ({`@link` MultiReader}, {`@link` ParallelReader})
   * require that the subreaders support the clone() operation (see {`@link` #isCloneSupported()}
   * in order to perform reopen with <code>closeOldReader==false</code>.  
   */
  public synchronized IndexReader reopen(boolean closeOldReader);

As the javadoc says it has two benefits: 1) it speeds up reopening and reduces ressources, and 2) it allows to reopen readers, that use non-cloneable subreaders.

Please let me know what you think about these changes, especially about the clone() implementation.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I ran new performance tests with the latest patch similar to the tests I explained in an earlier comment on this issue.

I'm using again a 4.5M index. In each round I delete one document and (re)open the IndexReader thereafter. Here are the numbers for 5000 rounds:

Time Speedup
open 703s
reopen(closeOldReader==false) 62s 11x
reopen(closeOldReader==true) 16s 44x

Now in each round I delete on document and also set the norm for one random document. Numbers for 1000 rounds:

Time Speedup
open 166s
reopen(closeOldReader==false) 33s 5.0x
reopen(closeOldReader==true) 29s 5.7x

I think these numbers look pretty good. We get a quite decent speedup even if the old readers are not closed.

I would like to commit this in a couple of days to get ready for Lucene 2.3. It would be nice if someone could review the latest patch! Hoss? Others?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

I think this looks pretty good Michael!
Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

Yeah, the cloning part was kind of tedious. Read-only readers would indeed make our life much easier here. I'm wondering how many people are using the IndexReader to alter the norms anyway?

Well, thanks for reviewing the patch, Yonik!

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

Nice to see all the good work on this. We are still on a 1.9 derivative.

Hopefully we'll be able to move to stock 2.X release in the future.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

I haven't looked at the patch yet (i really really plan to this weekend, baring catastrophe) but i'm confused as to why you went the cloning route (along with the complexity of the api changes to indicate when it is/isn't supported) ... based on the comments, it seems to boil down to...

> As an example for how the clone() method is used let me describe how MultiReader.reopen()
> works: it tries to reopen every of its subreaders. If at least one subreader could be reopened
> successfully, then a new MultiReader instance is created and the reopened subreaders are
> added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no
> index changes) is now cloned() and added to the new MultiReader.

that seems like circular logic: the clone method is used so that the sub readers can be cloned ?

why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

And if (for reasons that aren't clear to me) it is important for MultiReader to use a clone of it's subreaders when their reopen method returns "this" then shouldn't clients do the same thing? ... why not make reopen always return this.clone() if the index hasn't changed (which now that i think about it, would also help by punting on the isCloneSupported issue – each class would already know if it was clonable.

maybe this will make more sense once i read the patch ... i just wanted to through it out there in case someone had a chance to reply before i get a chance.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

Let's say you have a MultiReader with two subreaders:

IndexReader ir1 = IndexReader.open(index1);
IndexReader ir2 = IndexReader.open(index2);
IndexReader mr = new MultiReader(new IndexReader[] {ir1, ir2});

Now index1 changes and you reopen the MultiReader and keep the old one open:

IndexReader mr2 = mr.reopen();

ir1 would now be reopened and let's assume we wouldn't clone ir2. If you use mr2 now to e.g. delete a doc and that doc happens to be in index2, then mr1 would also see the changes because both MultiReaders share the same subreader ir2 and are thus not independent from each other.

> why not make reopen always return this.clone()

clone() might be an expensive operation. We only need to clone if at least one of the subreaders has changed.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> > Too bad so much needs to be cloned in the case that
> > closeOldReader==false... maybe someday in the future we can have
> > read-only readers.
>
> Yeah, the cloning part was kind of tedious. Read-only readers would
> indeed make our life much easier here. I'm wondering how many people
> are using the IndexReader to alter the norms anyway?

I think the closeOldReader=false case is actually quite important.

Because in production, I think you'd have to use that, so that your
old reader stays alive and is used to service incoming queries, up
until the point where the re-opened reader is "fully warmed".

Since fully warming could take a long time (minutes?) you need that
old reader to stay open.

Can we take a copy-on-write approach? EG, this is how OS's handle the
virtual memory pages when forking a process. This would mean whenever
a clone has been made of a SegmentReader, they cross-reference one
another. Then whenever either needs to alter deleted docs, one of them
makes a copy then. Likewise for the norms.

This would mean that "read-only" uses of the cloned reader never
pay the cost of copying the deleted docs bit vector nor norms.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually if we went back to the sharing (not cloning) approach, could
we insert a check for any writing operation against the re-opened
reader that throws an exception if the original reader is not yet
closed?

In Michael's example above, on calling mr2.deleteDoc, you would hit an
exception because mr2 would check and see that it's "original" reader
mr is not yet closed. But once you've closed mr, then the call
succeeds.

I think this would let us have our cake and eat it too: re-opening
would be very low cost for unchanged readers (no clones created), and,
you can still do deletes, etc, after you have closed your prior
reader. You control when your prior reader gets closed, to allow for
warming time and for queries to finish on the old reader.

Would this work?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

A couple other questions/points:

  • Just to verify: if you have a DirectoryIndexReader that is holding
    the write lock (because it has made changes to deletes/norms) then
    calling reopen on this reader should always return 'this', right?
    Because it has the write lock, it must be (better be!) current.

    This might be a good place to add an assert: if you are not
    current, assert that you don't have the write lock, and if you
    have the write lock, assert that you are current.

  • I think you should add "ensureOpen()" at the top of
    MultiReader.reopen(...)?

  • If an Exception is hit during reopen, what is the resulting state
    of your original reader? I think, ideally, it is unaffected by
    the attempt to re-open? EG if you're on the hairy edge of file
    descriptor limits, and the attempt to re-open failed because you
    hit the limit, ideally your original reader is unaffected (even if
    you specified closeOldReader=true).

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

Okay, read the patch. I'm on board with the need for Clonable now ... it's all about isolation. if "r.reopen(false) == r" then the client is responsible for recognizing that it's got the same instance and can make the judgement call about reusing the instance or cloning depending on it's needs ... internally in things like MultiReader we have to assume we need a clone for isolation.

Questions and comments...

  1. CompoundFileReader, FieldsReader, IndexReader, and BitVector all have clone methods where they silently catch and ignore CloneNotSupportedExceptions from super.clone()... if we don't expect the exception to ever happen, we should just let the exception propogate up the stack (there is no down side to declaring that clone() throws CloneNotSupportedException). If we think the exception might happen, but it's not a big deal if it does and we can ignore it, then we should put a comment in the catch block to that effect. i'm not clear which are the cases here.

  2. i don't remember if the old patches had this issue as well, but having "return new FilterIndexReader(reopened);" in FilterIndexReader doesn't really help anyone using FilterIndexReader – they're going to want an instance of their own subclass. to meet the contract of FilterIndexReader, we should implement all "abstract" methods and delegate to the inner reader - so in theory do don't need a new instance of FIlterIndexReader, we could just return in.reopen(closeold) ... my gut says it would be better to leave the method out entirely so that the default impl which throws UnSupOpEx is used — that way subclasses who want to use reopen must define their own (instead of getting confused when their filtering logic stops working after they reopen for the first time)

  3. instead of having an isClonable() method, why don't we just remove the "implements Clonable" declaration from IndexReader and put it on the concrete subclasses – then use "instanceof Cloneable" to determine if something is clonable? ... for that matter, the only place isCloneSupported is really used is in IndexReader.clone where an exception is thrown if Clone is not supported ... subclasses which know they are Clonable don't need this from the base class, subclasses which don't implement Clonable but are used in combination with core classes whose reopen method requires it could be skiped by the caller (since they don't implement Clonable) ..

  4. javadocs frequently refer to "reopened successfully" and "refreshed successfully" when what it really means is "reopen() returned a newer/fresher reader" ... this may confuse people who are use to thinking of "successfully" a "without error"

  5. random thought: are their any synchronization issues we're missing here? I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs. is there any chance for inconsistent state?

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

a rough variation on Michael's latest patch that makes the changes along two of the lines i was suggesting before reagrding FilterIndexReader and ising "instanceof Cloneable" instead of "isCloneSupported()" two important things to note:

  1. i didn't change any of the documentation, i was trying to change as little aspossibel so the two patches could be compared side-by-side

  2. this patch is currently broken. TestIndexReaderReopen gets an exception i can't understand ... but i have to stop now, and i wanted to post what i had in case people had comments.

...now that i've done this exercise, i'm not convinced that it's a better way to go, but it does seem like isCloneSupported isn't neccessary, that's the whole point of the Cloneable interface.

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

> I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs.

Hmmm there is cause for concern (and I should have had my mt-safe hat on :-)
Reopen is synchronized on the reader, and so are norms access and docs, but from a quick look:

  • reopen() may be synchronized, but clone() called on sub-readers isn't in a synchronized context that the sub-reader cares about. For example, MultiReader.reopen has the lock on the multireader, but calles subreader.clone() which iterates over the norms in a non thread-safe way.
  • IndexInput objects that are in use should never be cloned... (like what is done in FieldsReader.clone())

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Thanks all for the reviews and comments!

There seem to be some issues/open questions concerning the cloning.
Before I comment on them I think it would make sense to decide whether
we want to stick with the cloning approach or not. Mike suggests this
approach:

> Actually if we went back to the sharing (not cloning) approach, could
> we insert a check for any writing operation against the re-opened
> reader that throws an exception if the original reader is not yet
> closed?

Interesting, yes that should work in case we have two readers (the
original one and the re-opened one). But what happens if the user calls
reopen twice to get two re-opened instances back? Then there would be
three instances, and without cloning the two re-opened ones would also
share the same resources. Is this a desirable use case or would it be
okay to restrict reopen() so that it can only create one new instance?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> > Actually if we went back to the sharing (not cloning) approach,
> > could we insert a check for any writing operation against the
> > re-opened reader that throws an exception if the original reader
> > is not yet closed?
>
> Interesting, yes that should work in case we have two readers (the
> original one and the re-opened one). But what happens if the user
> calls reopen twice to get two re-opened instances back? Then there
> would be three instances, and without cloning the two re-opened ones
> would also share the same resources. Is this a desirable use case or
> would it be okay to restrict reopen() so that it can only create one
> new instance?

Hmmm good point.

Actually, we could allow more then one re-open call if we take the
following approach: every time a cloned Reader "borrows" a reference
to a sub-reader, it increments a counter (call it the "referrer
count"). When the Reader is closed, it decrements the count (by 1)
for each of the sub-readers.

Then, any reader should refuse to do a writing operation if its
"referrer" count is greater than 1, because it's being shared across
more than one referrer.

This way if you have a reader X and you did reopen to get Y and did
reopen again to get Z then the shared sub-readers between X, Y and Z
would not allow any write operations until 2 of the three had been
closed. I think that would work?

BTW this would also allow for very efficient "snapshots" during
searching: keeping multiple readers "alive", each searching a
different point-in-time commit of the index, because they would all
share the underlying segment readers that they have in common. Vs
cloning which would have to make many copies of each segment reader.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I just did a quick partial review of SegmentReader for thread safety only and ran across some potential issues

OK, let's scratch my "ready to commit" comment ;)

A question about thread-safety here. I agree that we must
fix all possible problems concerning two or more
IndexReaders in read-mode, like the FieldsReader issue.

On the other hand: We're saying that performing write
operations on a re-opened reader results in undefined
behavior. Some of the issues you mentioned, Yonik, should
only apply in case one of the shared readers is used to
perform index modifications, right? Then the question is:
how much sense does it make to make reopen() thread-safe
in the write case then?

So I think the multi-threaded testcase should not
perform index modifications using readers involved in a
reopen()?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

Sorry, I hadn't kept up with this issue wrt what was going to be legal (and we should definitely only test what will be legal in the MT test). So that removes the deletedDocs issue I guess.

@asfimport
Copy link
Author

Thomas Peuss (migrated from JIRA)

To find concurrency issues with an unit test is hard to do, because your potential problems lie in the time domain and not in the code domain. ;-)

From my experience following things can have impact on the results of such a test:

  • Running on SP or SMP machines. SMP machines (the more cores the better) reveal concurrency issues much earlier.
  • The Java implementation you are using. IBM's and Sun's thread implementations behave slightly different for example.
  • The OS you are running. This may seem odd in the first run but remember that modern Java implementations rely heavily on the threading implementations of the OS.
  • The processor platform you are running. NUMA vs. UMA (which is AMD vs. intel). The timing of threads can differ because of this.

And be prepared that one time your tests runs through without a problem and on the next run it breaks...

Just my € 0.02

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Changes in this patch:

  • Fixed ParallelReader and MultiReader so that they don't incRef the subreaders anymore in case reopen() is a NOOP (i. e. reopen() doesn't return a new instance)
  • In the new ctr in MultiSegmentReader it was possible to hit a NullPointerException during filling the norms cache, because I didn't check for null after retrieving the old reader from the HashMap. I fixed this.
  • SegmentReader now also overwrites decRef() and implements decRefReaderNotNorms().
  • As Mike suggested I removed "boolean sharedNorms" from SegmentReader. Now in MultiSegmentReader I compare the norm instances from the old and the new subReaders and copy the bytes to the new cache in case they are ==.
  • In SegmentReader I changed norms to be a HashMap instead of HashTable.
  • Norm.decRef() and Norm.incRef() are synchronized now.
  • SegmentReader#norms(String field, byte[] bytes, int offset) now synchronizes on the norm object that is to be read.
  • SegmentReader#reopen() now opens a new FieldsReader because it is not thread-safe.
  • SegmentReader.Norm has a new boolean variable "useSingleNormStream". SegmentReader#norms(String field, byte[] bytes, int offset) checks if it is true. If yes, then the readers' singleNormStream is used, otherwise norm.in. This is necessary so that a reopened SegmentReader always uses its own singleNormStream and to avoid synchronization on the singleNormStream.
  • I added a bunch of code to TestIndexReaderReopen to test the thread-safety of reopen(). It starts 150 threads: some modify the index (some delete docs, some add docs and some set norms), some reopen readers and check if the re-opened ones deliver the same results as fresh ones.
  • assertReaderClosed now checks if the norms are closed and also checks recursively if all subReaders are closed.

Still outstanding:

  • On the IBM JVM all tests pass. On Sun, the thread-safety test sometimes fails. When it fails, then in assertReaderClosed, because the refCounts of either the norms or some subReaders aren't 0 at the end of the test. At this point I'm not sure why and I'm still debugging. I just wanted to submit the patch to give others the chance to review the patch or possibly (hopefully) find the problem before me.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Changes:

  • Updated patch to current trunk (I just realized that the
    latest didn't apply cleanly anymore)
  • MultiSegmentReader now decRefs the subReaders correctly
    in case an exception is thrown during reopen()
  • Small changes in TestIndexReaderReopen.java

The thread-safety test still sometimes fails. The weird
thing is that the test verifies that the re-opened
readers always return correct results. The only problem
is that the refCount value is not always 0 at the end
of the test. I'm starting to think that the testcase
itself has a problem? Maybe someone else can take a look

  • it's probably something really obvious but I'm already
    starting to feel dizzy while pondering about
    thread-safety.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think the cause of the intermittant failure in the test is a missing
try/finally in doReopen to properly close/decRef everything on
exception.

Because of lockless commits, a commit could be in-process while you
are re-opening, in which case you could hit an IOexception and you
must therefore decRef those norms you had incRef'd (and, close eg the
newly opened FieldsReader).

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> I think the cause of the intermittant failure in the test is a missing
> try/finally in doReopen to properly close/decRef everything on
> exception.

Awesome! Thanks so much for pointing me there, Mike! I was getting a
little suicidal here already ... ;)

I should have read the comment in SegmentReader#initialize more
carefully:

    } finally {

      // With lock-less commits, it's entirely possible (and
      // fine) to hit a FileNotFound exception above.  In
      // this case, we want to explicitly close any subset
      // of things that were opened so that we don't have to
      // wait for a GC to do so.
      if (!success) {
        doClose();
      }
    }

While debugging, it's easy to miss such an exception, because
SegmentInfos.FindSegmentsFile#run() ignores it. But it's good that it
logs such an exception, I just have to remember to print out the
infoStream next time.

So it seems that this was indeed the cause for the failing test case.
I made the change and so far the tests didn't fail anymore (ran it
about 10 times so far). I'll run it another few times on a different
JVM and submit an updated patch in a short while if it doesn't fail
again.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

OK, all tests pass now, including the thread-safety test.
I ran it several times on different JVMs.

Changes:

  • As Mike suggested I added a try ... finally block to
    SegmentReader#reopenSegment() which cleans up after an
    exception was hit.
  • Added some additional comments.
  • Minor improvements to TestIndexReaderReopen

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Awesome! Thanks so much for pointing me there, Mike! I was getting a
little suicidal here already ...

No problem, I lost some hairs tracking that one down too!!

OK, latest patch looks good! I love the new threaded unit test.

Only two smallish comments:

  • You should also close fieldsReader when referencedSegmentReader !=
    null, right? (in SegmentReader.doClose)

  • In the new try/finally in reopenSegment: if you first setup
    referencedSegmentReader, then can't that finally clause just be
    clone.decRef() instead of duplicating code for decRef'ing norms,
    closeNorms(), etc.?

@asfimport
Copy link
Author

Yonik Seeley (@yonik) (migrated from JIRA)

So how about a public IndexReader.flush() method so that one could also reopen readers that were used for changes?

Usecase:

reader.deleteDocument()
reader.flush()
writer = new IndexWriter()
writer.addDocument()
writer.close()
reader.reopen()
reader.deleteDocument()

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

  • You should also close fieldsReader when referencedSegmentReader !=
    null, right? (in SegmentReader.doClose)

Yes, will do!

  • In the new try/finally in reopenSegment: if you first setup
    referencedSegmentReader, then can't that finally clause just be
    clone.decRef() instead of duplicating code for decRef'ing norms,
    closeNorms(), etc.?

Hmm, what if then in clone.close() an exception is thrown from
FieldsReader.close() or singleNormStream.close(). In that case it
would not decRef the referenced reader.

Hmm but actually we could change the order in close() so that
referencedSegmentReader.decRefReaderNotNorms() is done first even
if the following close() operations don't succeed?

@asfimport
Copy link
Author

asfimport commented Nov 14, 2007

Michael Busch (migrated from JIRA)

So how about a public IndexReader.flush() method

Since our goal is it to make IndexReader read-only in the future
(#2106), do you really think we need to add this?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm but actually we could change the order in close() so that
referencedSegmentReader.decRefReaderNotNorms() is done first even
if the following close() operations don't succeed?

+1

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

So how about a public IndexReader.flush() method

I think also if we do decide to do this we should open a new issue for it?

@asfimport
Copy link
Author

asfimport commented Nov 14, 2007

Yonik Seeley (@yonik) (migrated from JIRA)

> Since our goal is it to make IndexReader read-only in the future
> (#2106), do you really think we need to add this?

flush() would make reopen() useful in more cases, and #2106 is further off (not Lucene 2.3, right?)
Anyway, flush() would be considered a write operation like setNorm() & deleteDocument() and could be deprecated along with them in the future if that's how we decide to go.

> I think also if we do decide to do this we should open a new issue for it?

Yes, that's fine.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I think also if we do decide to do this we should open a new issue for it?

+1

I'll open a new issue.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Changes:

  • Close FieldsReader in SegmentReader#doClose() even if
    referencedReader!=null
  • Call clone.decRef() in the finally clause of
    SegmentReader#reopenSegment()
  • decRef referencedReader before closing other resources
    in SegmentReader#doClose()
  • Removed IndexReader#doCloseUnsharedResources().

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good. Only thing I found was this leftover
System.out.println, in SegmentReader.java:

  System.out.println("refCount " + getRefCount());

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Thanks for the review, Mike! I'll remove the println.

Ok, I think this patch has been reviewed a bunch of times and
should be ready to commit now. I'll wait another day and commit
it then if nobody objects.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Changes:

  • Updated to current trunk.
  • Removed println in SegmentReader.

I'm going to commit this soon!

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Committed! Phew!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant