Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor segmentInfos from IndexReader into its subclasses [LUCENE-986] #2062

Closed
asfimport opened this issue Aug 22, 2007 · 11 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Aug 22, 2007

References to segmentInfos in IndexReader cause different kinds of problems
for subclasses of IndexReader, like e. g. MultiReader.

Only subclasses of IndexReader that own the index directory, namely
SegmentReader and MultiSegmentReader, should have a SegmentInfos object
and be able to access it.

Further information:
http://www.gossamer-threads.com/lists/lucene/java-dev/51808
http://www.gossamer-threads.com/lists/lucene/java-user/52460

A part of the refactoring work was already done in #1856


Migrated from LUCENE-986 by Michael Busch, 1 vote, resolved Sep 20 2007
Attachments: lucene-986.patch (versions: 3)
Linked issues:

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

one aspect of this that should be considered: It may not make sense for MultiReader to extend MultiSegmentReader ... as Michael says, only subclasses that own the index directory should have segmentInfos, and a MultiReader (as defined on the trunk now) can never own it's own directory.

I haven't worked through all the implications, but perhaps the most logical refactoring would be...

  • IndexReader
    ...as abstract as possible given that we can't actually make methods abstract
  • DirectoryIndexReader extends IndexReader
    ...new class, encapsulated all the segmentInfos and locking logic currently in
    IndexReader (can definitely be made abstract if helpful)
  • SegmentReader extends DirectoryIndexReader
  • MultiSegmentReader extends DirectoryIndexReader
  • ParallelIndexReader extends IndexReader
  • FilterIndexReader extends IndexReader
  • MultiReader extends IndexReader
    ...(side note that i really haven't thought through completley: should
    MultiReader extend FilterIndexReader?)

there would likely be some utlity functionality that could be reused between MultiReader and MultiSegmentReader ... possible as static methods in IndexReader (or a new util class)

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

This is good stuff, Hoss. I like the DirectoryIndexReader idea.
Sounds like a very clean and logical separation.

I'll work through the code to understand all consequences.

@asfimport
Copy link
Author

asfimport commented Aug 30, 2007

Michael Busch (migrated from JIRA)

What do you think about this alternative approach:

  • SegmentReader, MultiSegmentReader and MultiReader all extend IndexReader
  • IndexReader.open() always returns a MultiSegmentReader instance, even
    if the index has only one segment. We can make some optimizations, so that
    e. g. MultiSegmentReader.termDocs() returns a SegmentTermDocs in case
    there's only one underlying SegmentReader.
  • All write lock and transaction logic goes into MultiSegmentReader, as
    it is the only reader that has to obtain a write lock.

The advantage here is that only one class (MultiSegmentReader) is
responsible for acquiring the write lock. It would also make
IndexReader.open() simpler, because it can then return a
MultiSegmentReader instance in all cases. The logic of opening the
different segments could move from IndexReader.open() to the constructor
of MultiSegmentReader. Also #1818 (IndexReader.reopen()) would
become simpler.
The disadvantage of this approach is apparently that always using
a MultiSegmentReader would add some overhead when reading optimized
indexes, but actually I don't think that this overhead would really be
significant.

If we go the DirectoryIndexReader way (where SegmentReader and
MultiSegmentReader both extend DirectoryIndexReader), then we still need
an ownDirectory variable that tells SegmentReader if it has to acquire a
write lock or not, depending on whether it is a subreader of
MultiSegmentReader or not. And of course we add another layer to the
IndexReader hierachy.

So I'm not sure which approach is the better one. I'm hoping to get some
opionions here! Hoss? Others?

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

i rarely remember a week later what i was thinking when i wrote something, but i suspect that when i suggested the DirectoryIndexReader i was assuming it would have everything directory/lock related thta currently exists in the IndexReader base class (including the directoryOwner boolean) ... in cases where there is a single Segment in a directory, there will be SegmentReader with directoryOwner==true ... in the multi segment cases, the MultiSegmentReader will have directoryOwner==true, and it's sub SegmentReaders will all have directoryOwner==false. ...

...the key point of DirectoryIndexReader being that any subclass can own a directory (and automaticly inherits all the code for dealing with locks properly when it needs/wants to) but doesn't always have to own the directory. meanwhile MultiReader (and ParallelIndexReader and FilteredIndexReader) make no attempt at owning a directory, and inherit no code for doing so (or for dealing with the locking of such non existent directories)

I don't really know enough about the performance characteristics of SegmentReader vs a MultiSegmentReader of one segment to have a sense of how possible/practical it would be to eliminate the need for SegmentReader and replace it completely with MultiSegmentReader ...

one hitch might be that SegmentReader.get is public, and in order to keep supporting it, SegmentReader still needs to have/inherit the same segment info and directory owning/locking code that we want to move out of IndexReader (so just putting it MultiSegmentReader won't fly unless we kill that public method)

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Here is the patch with the DirectoryIndexReader approach.

It moves SegmentInfos and Directory from IndexReader into
DirectoryIndexReader, as well as the commit and rollback logic.

MultiSegmentReader and SegmentReader extend DirectoryIndexReader.
MultiReader extends IndexReader now and uses some methods from
MultiSegmentReader that I made static.

I added the method acquireWriteLock() to IndexReader that does
nothing by default. Subclasses must implement it if they require
a write lock (so does DirectoryIndexReader now).

IndexReader is very abstract now and almost all logic moved into
the subclasses. The methods isOptimized(), isCurrent(), and
getVersion() all throw an UnsupportedOperationException (UOE). I
further deprecated the IndexReader constructor that takes a
Directory as argument. As soon as we remove this ctr the method
IndexReader.getDirectory() should throw an UOE as well. I added
a TODO comment as a reminder.

All unit tests pass with this patch.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> one hitch might be that SegmentReader.get is public, and in order to keep
> supporting it, SegmentReader still needs to have/inherit the same segment info
> and directory owning/locking code that we want to move out of IndexReader (so
> just putting it MultiSegmentReader won't fly unless we kill that public method).

OK, I implemented the DirectoryIndexReader approach. Also because I'm not sure
about the performance characteristics of a MultiSegmentReader acting as a
SegmentReader.

I'd like to commit this rather soon. A review of the patch would be highly
appreciated.

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

Michael: I've been meaning to look at this, but haven't had the time ... your recent update has goaded me :)

just to clarify: the patch you added on September 12th is your latest patch right? ... it's not clear from you comment on the 17th if you intended to attach an update and something went wrong.

I ask because i'm haivng trouble applying the patch from the 12th ... i must be tired because i can't understand why, it doesn't look like the files have changed since you posted the patch, so i'm not sure what it's complaining about ... visually everything seems to match up...

hossman@coaster:~/lucene/lucene$ svn update
At revision 576718.
hossman@coaster:~/lucene/lucene$ svn status
? lucene-986.patch
hossman@coaster:~/lucene/lucene$ patch --dry-run -p0 -i lucene-986.patch patching file src/java/org/apache/lucene/index/DirectoryIndexReader.java
patching file src/java/org/apache/lucene/index/FilterIndexReader.java
patching file src/java/org/apache/lucene/index/IndexReader.java
patching file src/java/org/apache/lucene/index/MultiReader.java
Hunk #1 FAILED at 19.
1 out of 2 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiReader.java.rej
patching file src/java/org/apache/lucene/index/MultiSegmentReader.java
Hunk #1 FAILED at 30.
Hunk #2 FAILED at 39.
Hunk #3 FAILED at 156.
Hunk #4 FAILED at 171.
Hunk #5 FAILED at 256.
Hunk #6 FAILED at 278.
6 out of 6 hunks FAILED – saving rejects to file src/java/org/apache/lucene/index/MultiSegmentReader.java.rej
patching file src/java/org/apache/lucene/index/ParallelReader.java
patching file src/java/org/apache/lucene/index/SegmentReader.java
Hunk #1 succeeded at 32 with fuzz 2.
patching file src/test/org/apache/lucene/index/TestMultiReader.java
hossman@coaster:~/lucene/lucene$

@asfimport
Copy link
Author

Chris M. Hostetter (@hossman) (migrated from JIRA)

I got the patch to apply cleanly (see mailing list for details) On the whole it looks really good, i'm attaching an updated version with some minor improvements (mainly javadocs), but first a few questions...

  • just to clarify: IndexReader(Directory) is only around for
    backwards compatibility in subclasses right? And the only reason
    for the "private Directory" is to keep supporting the directory()
    method for any subclasses relying on it?

  • IndexReader() says it can be removed once the other constructor is
    removed ... why? is that just pointing out that we can rely on the
    default constructor?

  • since one of the goals is that IndexReaders which don't own their
    directory shouldn't utilize segmentInfos, would it make sense to
    eliminate directoryOwner from DirectoryIndexReader and instead test
    whether (null == segmentInfos) ?

  • the way TestMultiReader is setup with the "mode" is a bit confusing
    ... perhaps instead we should make a subclass where only openReader
    is overwritten (it will inherit the individual test methods) ?

here's the list of tweaks I made...

  • added a note in the IndexReader class javadocs about
    methods that throw UnSupOp but aren't abstract.
  • added javadocs to the IndexReader(Directory) constructor based on my
    understanding
  • added back javadocs to IndexReader methods that now throw UnSupOp to
    make it clear what they do to callers looking at the API, but made
    it clear tthe default impls throw UnSupOp
  • made IndexReader.directory() throw UnSupOp if directory is null,
    enhanced it's javadocs.
  • /* NOOP */ in IndexReader.acquireWriteLock so it's clear to code analysis tools that it's not a mistake.
  • small additions to javadocs for DirectoryIndexReader class, but
    these should probably be elaborated on (depending on what you think
    of my segmentInfos==null idea above)
  • made DirectoryIndexReader(...) call init(...) to eliminate a small
    about of duplication.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> I got the patch to apply cleanly (see mailing list for details)

Thanks, Hoss! I'm using TortoiseSVN, I have to check how to set those
default properties for new files correctly.

> * just to clarify: IndexReader(Directory) is only around for
> backwards compatibility in subclasses right? And the only reason
> for the "private Directory" is to keep supporting the directory()
> method for any subclasses relying on it?

Yes, correct.

> * IndexReader() says it can be removed once the other constructor is
> removed ... why? is that just pointing out that we can rely on the
> default constructor?

Yes, just a suggested simplification. Keeping the constructor wouldn't hurt
though.

> * since one of the goals is that IndexReaders which don't own their
> directory shouldn't utilize segmentInfos, would it make sense to
> eliminate directoryOwner from DirectoryIndexReader and instead test
> whether (null == segmentInfos) ?

Sounds good, will do...

> * the way TestMultiReader is setup with the "mode" is a bit confusing
> ... perhaps instead we should make a subclass where only openReader
> is overwritten (it will inherit the individual test methods) ?

Yes, that's cleaner, will make that change as well.

> here's the list of tweaks I made...

The improvements look good to me.
Thanks for the review, Hoss! I'll attach a new patch shortly.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

In addition to Hoss' changes this patch:

  • Removes the boolean directoryOwner variable from DirectoryIndexReader
    and checks for segmentInfos != null instead. I also added a comment
    to DirectoryIndexReader about this.

  • TestMultiReader now extends the new unit test TestMultiSegmentReader
    and overwrites the method openReader().

I'm planning to commit this in a day or so if nobody objects.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Committed Rev. 577596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant