-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop writing shared doc stores across segments [LUCENE-2814] #3888
Comments
Earwin Burrfoot (migrated from JIRA) I'll take this. I think. |
Shai Erera (@shaie) (migrated from JIRA) This is great Mike. Does this mean we will be able to merge non consecutive segments? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Well... it's rather decoupled from this issue. Meaning, we were always free to allow non-contiguous merges, and segment would simply have forced the merge of the doc stores in that case. But, yes, because of the same reasons above, such a merge policy is much lower perf hit than it used to be. And so now it is again compelling to explore this.... we just have to go and do #2153. I think it should not be hard; we just have to relax the merge methods in IW that check that the segments are contiguous. This added freedom for the merge policy would be compelling, ie, it can then do a much better job cherry picking similarly sized segments for merging. I think it'd also simplify merge policies like BalancedSegMP, eg (I think?) no longer requiring a viterbi search to pick the best merges. |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks Earwin :) We should verify the back-compat test has at least one index w/ shared doc stores. DocumentsWriter should get simpler, ie its closeDocStores is now private, and, it should no longer need to call writer.checkpoint() since it will no longer alter the SIS (and no longer build a cfx file) on closing the doc stores. It would be nice to remove the closeDocStore methods from the full indexing chain, since now each component should do that itself, privately, on flush, but maybe we should do that as a 2nd step? |
Earwin Burrfoot (migrated from JIRA)
I believe I've seen some back-compat failures with my quick'n'dirty patch that removed both reading and writing shared docstores. |
Earwin Burrfoot (migrated from JIRA) First iteration. Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together? |
Michael McCandless (@mikemccand) (migrated from JIRA) Same as Earwin's patch just sync'd w/ current trunk. |
Michael McCandless (@mikemccand) (migrated from JIRA) OK I dug here... the reason why TestNRTThreads fails is because you moved the numDocsInRAM++ out of DW.getThreadState into WaitQueue.writeDocument. When we buffer del terms in DW.deleteTerm/Terms/Query/Queries, we grab the current numDocsInRAM as the "docID upto", to record when it comes time to apply the delete which docID we must stop at. But with your change, this value is now an undercount, since numDocsInRAM is now acting like numDocsInStore. One way to fix this would be to change the delete methods to use nextDocID instead of numDocsInRAM? But I think I'd prefer to put back numDocsInRAM++ in getThreadState... |
Michael Busch (migrated from JIRA) The shared doc stores are actually already completely removed in the realtime branch (part of LUCENE-2324). Does someone want to help with the merge, then we can land the realtime branch (which is pretty much only DWPT and removing doc stores) in trunk sometime soon? |
Earwin Burrfoot (migrated from JIRA) Ugh. What is the state for realtime branch? Is it stable? How soon is "sometime soon"? :) |
Michael Busch (migrated from JIRA) Well I need to merge with the recent changes in trunk (especially LUCENE-2680). If I can get most tests to pass again (most were passing before the merge), then I think the only outstanding thing is #3647 before we could land it in trunk. |
Michael McCandless (@mikemccand) (migrated from JIRA) I think taking things one step at a time would be good here? Ie remove doc stores from trunk, let that bake on trunk for a while, then merge to RT? So that what then remains on RT is DWPT / tiered flushing? Else RT is a monster change? |
Michael Busch (migrated from JIRA)
Probably still a smaller change than flex indexing ;) But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. In this particular case it's just a bit sad to redo all this work now, because I think I got the removal of doc stores right in RT and all related tests to pass. |
Earwin Burrfoot (migrated from JIRA) So, what's the plan? |
Michael Busch (migrated from JIRA)
I can't really work on this much before Saturday. But during the weekend I can work on the RT merge and maybe try to pull out the docstore removal changes and create a separate patch. Have to see how hard that is. If it's not too difficult I'll post a separate patch, otherwise I'll commit the merge to RT and maybe convince you guys to help a bit with getting the RT branch ready for landing in trunk? :) |
Earwin Burrfoot (migrated from JIRA) Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not. |
Earwin Burrfoot (migrated from JIRA) Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass. SegmentWriteState vs DocumentsWriter bother me. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Yes, true!
Well, not a mistake... this was unavoidable given that trunk was so far from what DWPT needs. But with per-seg deletes (#3754) done, and no more doc stores (this issue), I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!). I can help merge too! I think coordinating on IRC #lucene is a good idea? It seems like #3647 needs to be incorporated into IW's new FlushControl class (which is already coordinating flush-due-to-deletes and flush-due-to-added-docs-of-one-DWPT). |
Jason Rutherglen (migrated from JIRA)
Indeed!
It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira.
Right, into the DWPT branch. |
Steven Rowe (@sarowe) (migrated from JIRA)
#lucene-dev is logged. |
Jason Rutherglen (migrated from JIRA) Steven, is that on a wiki page? The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on |
Steven Rowe (@sarowe) (migrated from JIRA)
I don't know, I never put it anywhere, just discussed on dev@l.a.o. Feel free to do so if you like.
Yeah, it's very rarely used. Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want. |
Earwin Burrfoot (migrated from JIRA) New patch. Now with even more lines removed! DocStore-related index chain components used to track open/closed files through DocumentsWriter.
|
Michael McCandless (@mikemccand) (migrated from JIRA) Patch looks great! Nice work Earwin. I think it's ready to commit. Except, can you resync to trunk? I hit failures applying one hunk to Also, on the nocommit on exc in DW.addDocument, yes I think that Or: we could fix DW.abort to directly call Dir.deleteFile (instead of But, if we end up leaving the delete up above, we should put the I'm not looking forward to the 3.x back port!! |
Earwin Burrfoot (migrated from JIRA) Synced to trunk.
Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone. I have no other lingering cleanup urges, this is ready to be committed. I think. |
Michael McCandless (@mikemccand) (migrated from JIRA) OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x... Thanks Earwin! |
Jason Rutherglen (migrated from JIRA)
Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Well, 3.x is eligible for any new features that don't break back I'd like to see removing shared doc stores backported in particular And I think we should backport DWPT? But first things first, we gotta |
Jason Rutherglen (migrated from JIRA)
Really? It's a fairly large restructuring. Also, I realized that #3388 should be implementable in the single DW case, which perhaps could be backported as well. |
Michael McCandless (@mikemccand) (migrated from JIRA)
It is, but it's all under-the-hood.
Right! |
Michael Busch (migrated from JIRA)
Man, you guys really ruined my Sunday with this commit :) I got so many merge conflicts, that I decided to merge first only up to rev 1050655 (the rev before this commit) and up to HEAD in a second merge. I'm down to 64 compile errors (from 800), hopefully I can finish the merge tomorrow. Just wanted you to know that I'm making progress here with the DWPTs. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Sorry :( Is there any way I can help? |
Michael Busch (migrated from JIRA)
No worries - I'm being overly dramatic :)
Let me try to get it to compile, and then I'll commit. I'm sure a bunch of tests will fail, help would be great then. Also a general review of my changes to IndexWriter/DocumentsWriter/DWPT would be great. I should be able to commit my merge by end of today. |
Grant Ingersoll (@gsingers) (migrated from JIRA) Bulk close for 3.1 |
Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think.
It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment.
This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged.
However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is "congruent". Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be.
Furthermore, the optimization adds a lot of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (#3400), which gets us truly concurrent flushing, we can no longer share doc stores.
So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy.
Migrated from LUCENE-2814 by Michael McCandless (@mikemccand), resolved Jan 05 2011
Attachments: LUCENE-2814.patch (versions: 5)
Linked issues:
The text was updated successfully, but these errors were encountered: