-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexReader.reopen() [LUCENE-743] #1818
Comments
Otis Gospodnetic (@otisg) (migrated from JIRA) In a direct email to me, Robert said: "All of the files can be prepended with the ASL." |
robert engels (migrated from JIRA) A generic version probably needs to implement reference counting on the Segments or IndexReader in order to know when they can be safely closed. |
Chris M. Hostetter (@hossman) (migrated from JIRA) i somehow missed seeing this issues before ... i don't really understand the details, but a few comments that come to mind...
|
Michael Busch (migrated from JIRA) > i somehow missed seeing this issues before ... actually, me too... first I came across this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/31592?search_string=refresh;#31592 in which Doug suggested adding a static method IndexReader.open(IndexReader) I started implementing this, using Dougs and Roberts ideas and then realized > in generally we should probably try to approach reopening a reader as a Yeah we could do that. However, this might not be so easy to implement. Also the recursive walk would have to take place within the FindSegmentsFile I decided therefore to only allow IndexReaders to be refreshed if they were |
Michael Busch (migrated from JIRA) First version of my patch:
This first version is for review and not ready to All tests pass. |
Chris M. Hostetter (@hossman) (migrated from JIRA) > Yeah we could do that. However, this might not be so easy to implement. ...this being the curse that is MultiReader – it can serve two very differenet purposes. You seem to have already solved the multisegments in a single directory approach, the MultiReader over many subreader part actually seems much easier to me (just call your open method on all of the subreaders) the only tricky part is detecting which behavior should be used when. This could be driven by a simple boolean property of MultiReader indicating whether it owns it's directory and we need to look for new segments or not – in which case we just need to refresh the subreaders. (My personal preference would be to change MultiReader so "this.directory" is null if it was open over several other subReaders, right now it's just assigned to the first one arbitrarily, but there may be other consequences of changing that) Incidentally: I don't think it's crucial that this be done as a recursive method. the same approach i describe could be added to static utility like what you've got, I just think that if it's possible to do it recursively we should so that if someone does write their own MultiReader or SegmentReader subclass they can still benefit from any core reopening logic as long as theey do their part to "reopen" their extensions. |
Chris M. Hostetter (@hossman) (migrated from JIRA) an extremely hackish refactoring of the previous patch that demonstrates the method working recursively and dealing with MultiReaders constructed over multiple subReaders. a few notes:
|
Michael Busch (migrated from JIRA) Now, after #1856, #2046 and #1907 are committed, I updated the latest Notes:
I think the general contract of reopen() should be to always return a new IndexReader |
Michael Busch (migrated from JIRA) I ran some quick performance tests with this patch:
I run these two steps 5000 times in a loop. First run: Index size: 4.5M, optimized
Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)
|
Yonik Seeley (@yonik) (migrated from JIRA) > I think closing the old reader is fine. What do others think? Is keeping the old In a multi-threaded environment, one wants to open a new reader, but needs to wait until all requests finish before closing the old reader. Seems like reference counting is the only way to handle that case. |
Chris M. Hostetter (@hossman) (migrated from JIRA) (note: i haven't looked at the latest patch in detail, just commenting on the comments) One key problem i see with automatically closing things in reopen is MultiReader: it's perfectly legal to do something like this psuedocode... IndexReader a, b, c = ... one possibility would be for the semantics of reopen to close old readers only if it completely owns them; ie: MultiReader should never close anything in reopen, MultiSegmentReader should close all of the subreaders since it opened them in the first place ... things get into a grey area with SegementReader though. In general i think the safest thing to do is for reopen to never close. Yonik's comment showcases one of the most compelling reasons why it can be important for clients to be able to keep using an old IndexReader instance, and it's easy enough for clients that want the old one closed to do something like... IndexReader r = ... (one question that did jump out at me while greping the patch for the where old readers were being closed: why is the meat of reopen still in "IndexReader" with a "if (reader instanceof SegmentReader)" style logic in it? can't the different reopen mechanisms be refactored down into SegmentReader and MultiSegmentReader respectively? shouldn't the default impl of IndexReader throw an UnsuppportedOperationException?) |
Michael Busch (migrated from JIRA) > IndexReader a, b, c = ... So if 'b' in your example is a MultiSegmentReader, then the reopen() call > IndexReader r = ... ... is actually easy enough as you pointed out, so that the extra complexity is not > In general i think the safest thing to do is for reopen to never close. So yes, I agree. > why is the meat of reopen still in "IndexReader" with a "if (reader instanceof I'm not sure if the code would become cleaner if we did that. Sometimes a |
Chris M. Hostetter (@hossman) (migrated from JIRA) > I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to i don't hink there would be anything wrong with SegmentReader.reopen returning a MultiSegmentReader in some cases (or vice versa) but it definitely seems wrong to me for a parent class to be explicitly casing "this" to one of two know subclasses ... making reopen abstract in the base class (or throw UnsupportOp if for API back compatibility) seems like the only safe way to ensure any future IndexReader subclasses work properly. |
Michael Busch (migrated from JIRA) We should first refactor segmentInfos into IndexReader's subclasses. |
Testo Nakada (migrated from JIRA) Please also consider making an option where the reopen can be automated (i.e. when the index is updated) instead of having to call it explicitly. Thread safety should be taken into account as well. |
Michael Busch (migrated from JIRA) I'm attaching a new version of the patch that has a lot of changes compared to the last patch:
to IndexReader which returns false by default. IndexReader.clone() checks if the actual implementation supports clone() (i. e. the above method returns true). If not, it throws an UnsupportedOperationException, if yes, it returns super.clone().
I was not sure about whether to throw an (unchecked) UnsupportedOperationException or a CloneNotSupportedException in this case. I decided to throw UnsupportedOperationException even though it's not really following the clone() guidelines, because it avoids the necessity to catch the CloneNotSupportedException every time clone() is called (in the reopen() methods of all IndexReader implementations).
As an example for how the clone() method is used let me describe how MultiReader.reopen() works: it tries to reopen every of its subreaders. If at least one subreader could be reopened successfully, then a new MultiReader instance is created and the reopened subreaders are added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no index changes) is now cloned() and added to the new MultiReader.
- I also added the new method
{code:java}
/**
* In addition to {`@link` #reopen()} this methods offers the ability to close
* the old IndexReader instance. This speeds up the reopening process for
* certain IndexReader implementations and reduces memory consumption, because
* resources of the old instance can be reused for the reopened IndexReader
* as it avoids the need of copying the resources.
* <p>
* The reopen performance especially benefits if IndexReader instances returned
* by one of the <code>open()</code> methods are reopened with
* <code>closeOldReader==true</code>.
* <p>
* Certain IndexReader implementations ({`@link` MultiReader}, {`@link` ParallelReader})
* require that the subreaders support the clone() operation (see {`@link` #isCloneSupported()}
* in order to perform reopen with <code>closeOldReader==false</code>.
*/
public synchronized IndexReader reopen(boolean closeOldReader); As the javadoc says it has two benefits: 1) it speeds up reopening and reduces ressources, and 2) it allows to reopen readers, that use non-cloneable subreaders. Please let me know what you think about these changes, especially about the clone() implementation. |
Michael Busch (migrated from JIRA) I ran new performance tests with the latest patch similar to the tests I explained in an earlier comment on this issue. I'm using again a 4.5M index. In each round I delete one document and (re)open the IndexReader thereafter. Here are the numbers for 5000 rounds:
Now in each round I delete on document and also set the norm for one random document. Numbers for 1000 rounds:
I think these numbers look pretty good. We get a quite decent speedup even if the old readers are not closed. I would like to commit this in a couple of days to get ready for Lucene 2.3. It would be nice if someone could review the latest patch! Hoss? Others? |
Yonik Seeley (@yonik) (migrated from JIRA) I think this looks pretty good Michael! |
Michael Busch (migrated from JIRA) > Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers. Yeah, the cloning part was kind of tedious. Read-only readers would indeed make our life much easier here. I'm wondering how many people are using the IndexReader to alter the norms anyway? Well, thanks for reviewing the patch, Yonik! |
robert engels (migrated from JIRA) Nice to see all the good work on this. We are still on a 1.9 derivative. Hopefully we'll be able to move to stock 2.X release in the future. |
Chris M. Hostetter (@hossman) (migrated from JIRA) I haven't looked at the patch yet (i really really plan to this weekend, baring catastrophe) but i'm confused as to why you went the cloning route (along with the complexity of the api changes to indicate when it is/isn't supported) ... based on the comments, it seems to boil down to... > As an example for how the clone() method is used let me describe how MultiReader.reopen() that seems like circular logic: the clone method is used so that the sub readers can be cloned ? why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it? And if (for reasons that aren't clear to me) it is important for MultiReader to use a clone of it's subreaders when their reopen method returns "this" then shouldn't clients do the same thing? ... why not make reopen always return this.clone() if the index hasn't changed (which now that i think about it, would also help by punting on the isCloneSupported issue – each class would already know if it was clonable. maybe this will make more sense once i read the patch ... i just wanted to through it out there in case someone had a chance to reply before i get a chance. |
Michael Busch (migrated from JIRA) > why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it? Let's say you have a MultiReader with two subreaders: IndexReader ir1 = IndexReader.open(index1);
IndexReader ir2 = IndexReader.open(index2);
IndexReader mr = new MultiReader(new IndexReader[] {ir1, ir2}); Now index1 changes and you reopen the MultiReader and keep the old one open: IndexReader mr2 = mr.reopen(); ir1 would now be reopened and let's assume we wouldn't clone ir2. If you use mr2 now to e.g. delete a doc and that doc happens to be in index2, then mr1 would also see the changes because both MultiReaders share the same subreader ir2 and are thus not independent from each other. > why not make reopen always return this.clone() clone() might be an expensive operation. We only need to clone if at least one of the subreaders has changed. |
Michael McCandless (@mikemccand) (migrated from JIRA) > > Too bad so much needs to be cloned in the case that I think the closeOldReader=false case is actually quite important. Because in production, I think you'd have to use that, so that your Since fully warming could take a long time (minutes?) you need that Can we take a copy-on-write approach? EG, this is how OS's handle the This would mean that "read-only" uses of the cloned reader never |
Michael McCandless (@mikemccand) (migrated from JIRA) Actually if we went back to the sharing (not cloning) approach, could In Michael's example above, on calling mr2.deleteDoc, you would hit an I think this would let us have our cake and eat it too: re-opening Would this work? |
Michael McCandless (@mikemccand) (migrated from JIRA) A couple other questions/points:
|
Chris M. Hostetter (@hossman) (migrated from JIRA) Okay, read the patch. I'm on board with the need for Clonable now ... it's all about isolation. if "r.reopen(false) == r" then the client is responsible for recognizing that it's got the same instance and can make the judgement call about reusing the instance or cloning depending on it's needs ... internally in things like MultiReader we have to assume we need a clone for isolation. Questions and comments...
|
Chris M. Hostetter (@hossman) (migrated from JIRA) a rough variation on Michael's latest patch that makes the changes along two of the lines i was suggesting before reagrding FilterIndexReader and ising "instanceof Cloneable" instead of "isCloneSupported()" two important things to note:
...now that i've done this exercise, i'm not convinced that it's a better way to go, but it does seem like isCloneSupported isn't neccessary, that's the whole point of the Cloneable interface. |
Yonik Seeley (@yonik) (migrated from JIRA) > I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs. Hmmm there is cause for concern (and I should have had my mt-safe hat on :-)
|
Michael Busch (migrated from JIRA) Thanks all for the reviews and comments! There seem to be some issues/open questions concerning the cloning. > Actually if we went back to the sharing (not cloning) approach, could Interesting, yes that should work in case we have two readers (the |
Michael McCandless (@mikemccand) (migrated from JIRA) > > Actually if we went back to the sharing (not cloning) approach, Hmmm good point. Actually, we could allow more then one re-open call if we take the Then, any reader should refuse to do a writing operation if its This way if you have a reader X and you did reopen to get Y and did BTW this would also allow for very efficient "snapshots" during |
Michael Busch (migrated from JIRA)
OK, let's scratch my "ready to commit" comment ;) A question about thread-safety here. I agree that we must On the other hand: We're saying that performing write So I think the multi-threaded testcase should not |
Yonik Seeley (@yonik) (migrated from JIRA) Sorry, I hadn't kept up with this issue wrt what was going to be legal (and we should definitely only test what will be legal in the MT test). So that removes the deletedDocs issue I guess. |
Thomas Peuss (migrated from JIRA) To find concurrency issues with an unit test is hard to do, because your potential problems lie in the time domain and not in the code domain. ;-) From my experience following things can have impact on the results of such a test:
And be prepared that one time your tests runs through without a problem and on the next run it breaks... Just my € 0.02 |
Michael Busch (migrated from JIRA) Changes in this patch:
Still outstanding:
|
Michael Busch (migrated from JIRA) Changes:
The thread-safety test still sometimes fails. The weird
|
Michael McCandless (@mikemccand) (migrated from JIRA) I think the cause of the intermittant failure in the test is a missing Because of lockless commits, a commit could be in-process while you |
Michael Busch (migrated from JIRA) > I think the cause of the intermittant failure in the test is a missing Awesome! Thanks so much for pointing me there, Mike! I was getting a I should have read the comment in SegmentReader#initialize more } finally {
// With lock-less commits, it's entirely possible (and
// fine) to hit a FileNotFound exception above. In
// this case, we want to explicitly close any subset
// of things that were opened so that we don't have to
// wait for a GC to do so.
if (!success) {
doClose();
}
} While debugging, it's easy to miss such an exception, because So it seems that this was indeed the cause for the failing test case. |
Michael Busch (migrated from JIRA) OK, all tests pass now, including the thread-safety test. Changes:
|
Michael McCandless (@mikemccand) (migrated from JIRA)
No problem, I lost some hairs tracking that one down too!! OK, latest patch looks good! I love the new threaded unit test. Only two smallish comments:
|
Yonik Seeley (@yonik) (migrated from JIRA) So how about a public IndexReader.flush() method so that one could also reopen readers that were used for changes? Usecase: reader.deleteDocument() |
Michael Busch (migrated from JIRA)
Yes, will do!
Hmm, what if then in clone.close() an exception is thrown from Hmm but actually we could change the order in close() so that |
Michael Busch (migrated from JIRA)
Since our goal is it to make IndexReader read-only in the future |
Michael McCandless (@mikemccand) (migrated from JIRA)
+1 |
Michael McCandless (@mikemccand) (migrated from JIRA)
I think also if we do decide to do this we should open a new issue for it? |
Yonik Seeley (@yonik) (migrated from JIRA) > Since our goal is it to make IndexReader read-only in the future flush() would make reopen() useful in more cases, and #2106 is further off (not Lucene 2.3, right?) > I think also if we do decide to do this we should open a new issue for it? Yes, that's fine. |
Michael Busch (migrated from JIRA)
+1 I'll open a new issue. |
Michael Busch (migrated from JIRA) Changes:
|
Michael McCandless (@mikemccand) (migrated from JIRA) Patch looks good. Only thing I found was this leftover System.out.println("refCount " + getRefCount()); |
Michael Busch (migrated from JIRA) Thanks for the review, Mike! I'll remove the println. Ok, I think this patch has been reviewed a bunch of times and |
Michael Busch (migrated from JIRA) Changes:
I'm going to commit this soon! |
Michael Busch (migrated from JIRA) Committed! Phew!!! |
This is Robert Engels' implementation of IndexReader.reopen() functionality, as a set of 3 new classes (this was easier for him to implement, but should probably be folded into the core, if this looks good).
Migrated from LUCENE-743 by Otis Gospodnetic (@otisg), 3 votes, resolved Nov 17 2007
Attachments: IndexReaderUtils.java, lucene-743.patch (versions: 3), lucene-743-take10.patch, lucene-743-take2.patch, lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, lucene-743-take6.patch, lucene-743-take7.patch, lucene-743-take8.patch, lucene-743-take9.patch, MyMultiReader.java, MySegmentReader.java, varient-no-isCloneSupported.BROKEN.patch
Linked issues:
The text was updated successfully, but these errors were encountered: