-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel incremental indexing [LUCENE-1879] #2954
Comments
Michael Busch (migrated from JIRA) I have a prototype version which I implemented in IBM; it contains a version that works on Lucene 2.4.1. I'm not planning on committing as is, because it is implemented on top of Lucene's APIs without any core change and therefore not as efficiently as it could be. The software grant I have lists these files. Shall I attach the tar + md5 here and send the signed software grant to you, Grant? |
Grant Ingersoll (@gsingers) (migrated from JIRA) Yes on the soft. grant. |
Michael Busch (migrated from JIRA) MD5 (parallel_incremental_indexing.tar) = b9a92850ad83c4de2dd2f64db2dcceab This tar file contains all files listed in the software grant. It is a prototype that works with Lucene 2.4.x only, not with current trunk. Next I'll work on a patch that runs with current trunk. |
Michael McCandless (@mikemccand) (migrated from JIRA) I wonder if we could change Lucene's index format to make this feature Ie, you're having to go to great lengths (since this is built What if we could invert this approach, so that we use only single Whenever a doc is indexed, postings from the fields are then written Could something like this work? |
Michael Busch (migrated from JIRA) I realize the current implementation that's attached here is quite However, I really like its flexibility. You can right now easily #3100 only optimizes a certain use case of the parallel indexing, In other use cases it is certainly desirable to have a parallel index The version of parallel indexing that goes into Lucene's core I You can keep thinking about the whole index as a collection of segments, E.g. the norms could in the future be a parallel segment with a single Things like two-dimensional merge policies will nicely fit into this Different SegmentWriter implementations will allow you to write single So I agree we can achieve updating posting lists the way you describe,
What do you think? Of course I don't want to over-complicate all this, |
Michael McCandless (@mikemccand) (migrated from JIRA) This sounds great! In fact your proposal for a ParallelSegmentWriter It'd then a low-level question of whether ParallelSegmentWriter stores This should also fit well into #2532 (flexible indexing) – one
Can you elaborate on this? How is addIndexes* term-at-a-time?
Dimension 1 is the docs, and dimension 2 is the assignment of fields |
Michael Busch (migrated from JIRA)
Right. The goal should it be to be able to use this for updating Lucene internal things (like norms, column-stride fields), but also giving advanced users APIs, so that they can partition their data into parallel indexes according to their update requirements (which the current "above Lucene" approach allows).
Exactly! We should also keep the distributed indexing use case in mind here. It could make sense for systems like Katta to not only shard in the document direction.
Sounds great! |
Michael Busch (migrated from JIRA)
Let's say we have an index 1 with two fields a and b and you want to create a new parallel index 2 in which you copy all posting lists of field b. You can achieve this by using addDocument(), if you iterate on all posting lists in 1b in parallel and create for each document in 1 a corresponding document in 2 that contains the terms of the postings lists from 1b that have a posting for the current document. This I called the "document-at-a-time approach". However, this is terribly slow (I tried it out), because of all the posting lists you perform I/O on in parallel. It's far more efficient to copy an entire posting list over from 1b to 2, because then you only perform sequential I/O. And if you use 2.addIndexes(IndexReader(1b)), then exactly this happens, because addIndexes(IndexReader) uses the SegmentMerger to add the index. The SegmentMerger iterates the dictionary and consumes the posting lists sequentially. That's why I called this "term-at-a-time approach". In my experience this is for a similar use case as the one I described here orders of magnitudes more efficient. My doc-at-a-time algorithm ran ~20 hours, the term-at-a-time one 8 minutes! The resulting indexes were identical. |
Michael Busch (migrated from JIRA)
Yes, dimension 1 is unambiguously the docs. Dimension 2 can be the fields into separate parallel indexes, or also what we call today generations for e.g. the norms files. |
Shai Erera (@shaie) (migrated from JIRA) (Warning, this post is long, and is easier to read in JIRA) I've investigated the attached code a lot and I'd like to propose a different design and approach to this whole Parallel Index solution. I'll start by describing the limitations of the current design (whether its the approach or the code is debatable):
I'd like to point out that even if the above limitations can be worked around, I still think the Master and Slave notion is not the best approach. At least, I'd like to propose a different approach:
I realize that accepting only Directory on PW might limit applications who want to pass in their own IW extension, for whatever reason. But other than saying "if you pass in IW and configure it afterwards, it's on your head", I don't think there is any other option ... Well maybe except if we expose a package-private API for PW to turn off configuration on an IW after it set it, so successive calls to the underlying IW's setters will throw an exception ... hmm might be doable. I'll look into that. If that will work, we might want to do the same for the ParallelReader as well. Michael mentioned a scenario above where one would want to rebuild an index Slice. That's still achievable by this design - one should build the IW on the outside and then replace the Directory instance on PW. We'll need to expose such API as well. BTW, some of the things I've mentioned can be taken care of in different issues, as follow on improvements, such as two-level concurrency, supporting custom MS etc. I've detailed them here just so we all see the bigger picture that's going on in my head. I think I wrote all (or most) of the high-level details. I'd like to start implementing this soon. In my head it's all chewed and digested, so I feel I can start implementing today. If possible, I'd like to get this out in 3.1. I'll try to break this issue down to as many issues as I can, to make the contributions containable. We should just keep in mind for each such issue the larger picture it solves. I'd appreciate your comments. |
Michael McCandless (@mikemccand) (migrated from JIRA) I like the ParallelWriter (index slices) approach! It sounds quite feasible and more "direct" in how the PW controls each Some of this will require IW to open up some APIs – eg making docID |
Michael Busch (migrated from JIRA) #3400 will be helpful to support multi-threaded parallel-indexing. If we have single-threaded DocumentsWriters, then it should be easy to have a ParallelDocumentsWriter? |
Shai Erera (@shaie) (migrated from JIRA) The way I planned to support multi-threaded indexing is to do a two-phase addDocument. First, allocate a doc ID from DocumentsWriter (synchronized) and then add the Document to each Slice with that doc ID. DocumentsWriter was not suppose to know it is a parallel index ... something like the following. int docId = obtainDocId();
for (IndexWriter slice : slices) {
slice.addDocument(docId, Document);
} That allows ParallelWriter to be really an orchestrator/manager of all slices, while each slice can be an IW on its own. Now, when you say ParallelDocumentsWriter, I assume you mean that that DocWriter will be aware of the slices? That I think is an interesting idea, which is unrelated to #3400. I.e., ParallelWriter will invoke its addDocument code which will get down to ParallelDocumentWriter, which will allocate the doc ID itself and call each slice's DocWriter.addDocument? And then #3400 will just improve the performance of that process? This might require a bigger change to IW then I had anticipated, but perhaps it's worth it. What do you think? |
Grant Ingersoll (@gsingers) (migrated from JIRA) First off, I haven't looked at the code here or the comments beyond skimming, but this is something I've had in my head for a long time, but don't have any code. When I think about the whole update problem, I keep coming back to the notion of Photoshop Layers that essentially mask the underlying part of the photo, w/o damaging it. The analogy isn't quite the same here, but nevertheless... This leads me to wonder if the solution isn't best achieved at the index level and not at the Reader/Writer level. So, thinking out loud here and I'm not sure on the best wording of this: On the search side, I think performance would still be maintained b/c even in high update envs. you aren't usually talking about more than a few thousand changes in a minute or two and the background merger would be responsible for keeping the total number of disjoint documents low. |
Shai Erera (@shaie) (migrated from JIRA) Hi Grant - I believe what you describe is related to solving the incremental field updates problem, where someone might want to change the value of a specific document's field. But PI is not about that. Rather, PI is about updating a whole slice at once, ie, changing a field's value across all docs, or adding a field to all docs (I believe such question was asked on the user list few days ago). I've listed above several scenarios where PI is useful for, but unfortunately it is unrelated to incremental field updates. If I misunderstood you, then please clarify. Re incremental field updates, I think your direction is interesting, and deserves discussion, but in a separate issue/thread? |
Grant Ingersoll (@gsingers) (migrated from JIRA) Thanks, Shai, I had indeed misread the intent, and was likely further confused due to the fact that Michael B and I discussed it over tasty Belgian Beer in Oakland. I'll open a discussion on list for incremental field updates. |
Michael Busch (migrated from JIRA)
FWIW: The attached code and approach was never meant to be committed. I attached it for legal reasons, as it contains the IP that IBM donated to Apache via the software grant. Apache requires to attach the code that is covered by such a grant. I wouldn't want the master/slave approach in Lucene core. You can implement it much nicer inside of Lucene. The attached code however was developed with the requirement of having to run on top of an unmodified Lucene version.
The code runs without exceptions with Lucene 2.4. It doesn't work with 2.9/3.0, but you'll find an upgraded version that works with 3.0 within IBM, Shai. |
Shai Erera (@shaie) (migrated from JIRA) I have found such version ... and it fails too :). At least the one I received. But never mind that ... as long as we both agree the implementation should change. I didn't mean to say anything bad about what you did .. I know the limitations you had to work with. |
An Hong Yun (migrated from JIRA) Hi, Michael Is there any lastest progress on this topic? I am very interested in this! |
Eks Dev (migrated from JIRA) The user mentioned above in comment was me, I guess. Commenting here just to add interesting use case that would be perfectly solved by this issue. Imagine solr Master - Slave setup, full document contains CONTENT and ID fields, e.g. 200Mio+ collection. On master, we need field ID indexed in order to process delete/update commands. On slave, we do not need lookup on ID and would like to keep our TermsDictionary small, without exploding TermsDictionary with 200Mio+ unique ID terms (ouch, this is a lot compared to 5Mio unique terms in CONTENT, with or without pulsing). With this issue, this could be nativly achieved by modifying solr UpdateHandler not to transfer "ID-Index" to slaves at all. There are other ways to fix it, but this would be the best.(I am currently investigating an option to transfer full index on update, but to filter-out TermsDictionary on IndexReader level (it remains on disk, but this part never gets accessed on slaves). I do not know yet if this is possible at all in general , e.g. FST based term dictionary is already built (prefix compressed TermDict would be doable) |
Steven Rowe (@sarowe) (migrated from JIRA) Bulk move 4.4 issues to 4.5 and 5.0 |
A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
Find details on the wiki page for this feature:
http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing
Discussion on java-dev:
http://markmail.org/thread/ql3oxzkob7aqf3jd
Migrated from LUCENE-1879 by Michael Busch, 5 votes, updated May 09 2016
Attachments: parallel_incremental_indexing.tar
Linked issues:
The text was updated successfully, but these errors were encountered: