Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use bulk-copy when merging stored fields and term vectors [LUCENE-1737] #2811

Closed
asfimport opened this issue Jul 9, 2009 · 6 comments

Comments

@asfimport
Copy link

asfimport commented Jul 9, 2009

Lucene has nice optimizations in place during merging of stored fields
(#2119) and term vectors (#2197) whereby the bytes are
bulk copied to the new segmetn. This is much faster than decoding &
rewriting one document at a time.

However the optimization is rather brittle: it relies on the mapping
of field name to number to be the same ("congruent") for the segment
being merged.

Unfortunately, the field mapping will be congruent only if the app
adds the same fields in precisely the same order to each document.

I think we should fix IndexWriter to assign the same field number for
a given field that has been assigned in the past. Ie, when writing a
new segment, we pre-seed the field numbers based on past segments.
All other aspects of FieldInfo would remain fully dynamic.


Migrated from LUCENE-1737 by Michael McCandless (@mikemccand), resolved Dec 19 2010
Attachments: LUCENE-1737.patch (versions: 2)
Linked issues:

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Clearing 2.9 fix version.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

This turned out to be very simply – a tiny patch!

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I realized we should fix a few more cases here to use bulk-copy more often. First, on opening a pre-4.0 index, we should sweep all segments to union the FieldInfos so newly written segments are congruent with all past segments as much as possible. Second, when merging we should start from the current FieldInfos.

Even with this, if you addIndexes(Directory[]), which simply copies in new segments, if the fields name->number assignment on those incoming indices doesn't match the current index, then when those segments are merged they can't be bulk copied.

@asfimport
Copy link
Author

asfimport commented Dec 14, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

The fixes above can only be done once we always merge doc stores on merging segments, which will be done in #3888.

@asfimport
Copy link
Author

asfimport commented Dec 14, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch.

It has one nocommit which we can remove once #3888 is in.

@asfimport
Copy link
Author

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment