Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop writing shared doc stores across segments [LUCENE-2814] #3888

Closed
asfimport opened this issue Dec 13, 2010 · 34 comments
Closed

stop writing shared doc stores across segments [LUCENE-2814] #3888

asfimport opened this issue Dec 13, 2010 · 34 comments

Comments

@asfimport
Copy link

asfimport commented Dec 13, 2010

Shared doc stores enables the files for stored fields and term vectors to be shared across multiple segments. We've had this optimization since 2.1 I think.

It works best against a new index, where you open an IW, add lots of docs, and then close it. In that case all of the written segments will reference slices a single shared doc store segment.

This was a good optimization because it means we never need to merge these files. But, when you open another IW on that index, it writes a new set of doc stores, and then whenever merges take place across doc stores, they must now be merged.

However, since we switched to shared doc stores, there have been two optimizations for merging the stores. First, we now bulk-copy the bytes in these files if the field name/number assignment is "congruent". Second, we now force congruent field name/number mapping in IndexWriter. This means this optimization is much less potent than it used to be.

Furthermore, the optimization adds a lot of hair to IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over time, and causes odd behavior like a merge possibly forcing a flush when it starts. Finally, with DWPT (#3400), which gets us truly concurrent flushing, we can no longer share doc stores.

So, I think we should turn off the write-side of shared doc stores to pave the path for DWPT to land on trunk and simplify IW/DW. We still must support reading them (until 5.0), but the read side is far less hairy.


Migrated from LUCENE-2814 by Michael McCandless (@mikemccand), resolved Jan 05 2011
Attachments: LUCENE-2814.patch (versions: 5)
Linked issues:

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

I'll take this. I think.

@asfimport
Copy link
Author

Shai Erera (@shaie) (migrated from JIRA)

This is great Mike. Does this mean we will be able to merge non consecutive segments?

@asfimport
Copy link
Author

asfimport commented Dec 14, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

Does this mean we will be able to merge non consecutive segments?

Well... it's rather decoupled from this issue. Meaning, we were always free to allow non-contiguous merges, and segment would simply have forced the merge of the doc stores in that case.

But, yes, because of the same reasons above, such a merge policy is much lower perf hit than it used to be. And so now it is again compelling to explore this.... we just have to go and do #2153. I think it should not be hard; we just have to relax the merge methods in IW that check that the segments are contiguous.

This added freedom for the merge policy would be compelling, ie, it can then do a much better job cherry picking similarly sized segments for merging. I think it'd also simplify merge policies like BalancedSegMP, eg (I think?) no longer requiring a viterbi search to pick the best merges.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Earwin :)

We should verify the back-compat test has at least one index w/ shared doc stores.

DocumentsWriter should get simpler, ie its closeDocStores is now private, and, it should no longer need to call writer.checkpoint() since it will no longer alter the SIS (and no longer build a cfx file) on closing the doc stores.

It would be nice to remove the closeDocStore methods from the full indexing chain, since now each component should do that itself, privately, on flush, but maybe we should do that as a 2nd step?

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

We should verify the back-compat test has at least one index w/ shared doc stores.

I believe I've seen some back-compat failures with my quick'n'dirty patch that removed both reading and writing shared docstores.
So it should be ok.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

First iteration.

Passes all tests except TestNRTThreads. Something to do with numDocsInStore and numDocsInRam merged together?
Lots of non-critical nocommits (just markers for places I'd like to recheck).
DW.docStoreEnabled and *.closeDocStore() have to go, before committing

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Same as Earwin's patch just sync'd w/ current trunk.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I dug here... the reason why TestNRTThreads fails is because you moved the numDocsInRAM++ out of DW.getThreadState into WaitQueue.writeDocument.

When we buffer del terms in DW.deleteTerm/Terms/Query/Queries, we grab the current numDocsInRAM as the "docID upto", to record when it comes time to apply the delete which docID we must stop at.

But with your change, this value is now an undercount, since numDocsInRAM is now acting like numDocsInStore.

One way to fix this would be to change the delete methods to use nextDocID instead of numDocsInRAM?

But I think I'd prefer to put back numDocsInRAM++ in getThreadState...

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

The shared doc stores are actually already completely removed in the realtime branch (part of LUCENE-2324).

Does someone want to help with the merge, then we can land the realtime branch (which is pretty much only DWPT and removing doc stores) in trunk sometime soon?

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Ugh.
On first glance @ realtime branch, my patch is not a strict subset. But most stuff, including things I'm struggling with now, is definetly in.
So, I guess, I'll stop wasting effort.

What is the state for realtime branch? Is it stable? How soon is "sometime soon"? :)

@asfimport
Copy link
Author

asfimport commented Dec 16, 2010

Michael Busch (migrated from JIRA)

Well I need to merge with the recent changes in trunk (especially LUCENE-2680).
The merge is pretty hard, but I'm planning to spend most of my weekend on it.

If I can get most tests to pass again (most were passing before the merge), then I think the only outstanding thing is #3647 before we could land it in trunk.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think taking things one step at a time would be good here?

Ie remove doc stores from trunk, let that bake on trunk for a while, then merge to RT? So that what then remains on RT is DWPT / tiered flushing? Else RT is a monster change?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

I think taking things one step at a time would be good here?

Probably still a smaller change than flex indexing ;)

But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far. In this particular case it's just a bit sad to redo all this work now, because I think I got the removal of doc stores right in RT and all related tests to pass.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

So, what's the plan?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

So, what's the plan?

I can't really work on this much before Saturday. But during the weekend I can work on the RT merge and maybe try to pull out the docstore removal changes and create a separate patch. Have to see how hard that is. If it's not too difficult I'll post a separate patch, otherwise I'll commit the merge to RT and maybe convince you guys to help a bit with getting the RT branch ready for landing in trunk? :)

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Instead of you pulling out docstore removal, I can finish that patch. But then merging's gonna be even greater bitch. Probably. But maybe not.
Do you do IRC? It can be faster to discuss in realtime, and you could also tell what help you need with the branch.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass.

SegmentWriteState vs DocumentsWriter bother me.
We track flushed files in both, we inconsistently get current segment from both of them.

@asfimport
Copy link
Author

asfimport commented Dec 17, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

Probably still a smaller change than flex indexing

Yes, true!

But yeah in general I agree that we should do things more incrementally. I think that's a mistake I've made with the RT branch so far.

Well, not a mistake... this was unavoidable given that trunk was so far from what DWPT needs. But with per-seg deletes (#3754) done, and no more doc stores (this issue), I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!).

I can help merge too! I think coordinating on IRC #lucene is a good idea? It seems like #3647 needs to be incorporated into IW's new FlushControl class (which is already coordinating flush-due-to-deletes and flush-due-to-added-docs-of-one-DWPT).

@asfimport
Copy link
Author

asfimport commented Dec 17, 2010

Jason Rutherglen (migrated from JIRA)

I think we've got DWPT down to about as bite sized as it can be (it's still gonna be big!)

Indeed!

I think coordinating on IRC #lucene is a good idea?

It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira.

It seems like #3647 needs to be incorporated into IW's new FlushControl class

Right, into the DWPT branch.

@asfimport
Copy link
Author

Steven Rowe (@sarowe) (migrated from JIRA)

I think coordinating on IRC #lucene is a good idea?

It'd be nice if there were a log of IRC #lucene, otherwise I prefer Jira.

#lucene-dev is logged.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Steven, is that on a wiki page?

The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on

@asfimport
Copy link
Author

Steven Rowe (@sarowe) (migrated from JIRA)

Steven, is that on a wiki page?

I don't know, I never put it anywhere, just discussed on dev@l.a.o. Feel free to do so if you like.

The usage seems a little slim? http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2010-12-17;raw=on

Yeah, it's very rarely used.

Several Lucene people who use #lucene are strongly against logging, so I set up #lucene-dev as a place to host on-the-record Lucene conversations. I mentioned it because this is what you want.

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

New patch. Now with even more lines removed!

DocStore-related index chain components used to track open/closed files through DocumentsWriter.
Closed files list was unused, and is silently gone.
Open files list was used to:

  • prevent not-yet-flushed shared docstores from being deleted by IndexFileDeleter.
    • no shared docstores, no need + IFD no longer requires a reference to DW
  • delete already opened docstore files, when aborting.
    • index chain now handles this on its own + has cleaner error handling code.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great! Nice work Earwin. I think it's ready to commit.

Except, can you resync to trunk? I hit failures applying one hunk to
DW.java.

Also, on the nocommit on exc in DW.addDocument, yes I think that
(IFD.deleteNewFiles, not checkpoint) is still needed because DW can
orphan the store files on abort?

Or: we could fix DW.abort to directly call Dir.deleteFile (instead of
relying on IFD.deleteNewFiles). Ie, w/ no shared doc stores, these
files should never have been registered w/ IFD so they can be
privately managed by DW.

But, if we end up leaving the delete up above, we should put the
docWriter null check back so silly apps that close IW while still
indexing don't get NPEs.

I'm not looking forward to the 3.x back port!!

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

Synced to trunk.

Also, on the nocommit on exc in DW.addDocument, yes I think that (IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the store files on abort?

Orphaned files are deleted directly in StoredFieldsWriter.abort() and TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now gone.
Turns out checkpoint() is also no longer needed.

I have no other lingering cleanup urges, this is ready to be committed. I think.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x...

Thanks Earwin!

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

backporting to 3.x...

Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Out of curiosity, why are we backporting to 3.x or are we planning on also backporting the DWPT branch?

Well, 3.x is eligible for any new features that don't break back
compat. It's a feature branch not a bug-fix branch.

I'd like to see removing shared doc stores backported in particular
because it's a good simplification of the code, ie, it can prevent
sneaky bugs.

And I think we should backport DWPT? But first things first, we gotta
land it on trunk :) And let it bake some... then backport.

@asfimport
Copy link
Author

asfimport commented Dec 19, 2010

Jason Rutherglen (migrated from JIRA)

And I think we should backport DWPT?

Really? It's a fairly large restructuring.

Also, I realized that #3388 should be implementable in the single DW case, which perhaps could be backported as well.

@asfimport
Copy link
Author

asfimport commented Dec 19, 2010

Michael McCandless (@mikemccand) (migrated from JIRA)

Really? It's a fairly large restructuring.

It is, but it's all under-the-hood.

Also, I realized that #3388 should be implementable in the single DW case, which perhaps could be backported as well.

Right!

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

OK I committed to trunk. I'll let this bake for a while on trunk before backporting to 3.x...
Thanks Earwin!

Man, you guys really ruined my Sunday with this commit :)

I got so many merge conflicts, that I decided to merge first only up to rev 1050655 (the rev before this commit) and up to HEAD in a second merge.

I'm down to 64 compile errors (from 800), hopefully I can finish the merge tomorrow. Just wanted you to know that I'm making progress here with the DWPTs.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Man, you guys really ruined my Sunday with this commit

Sorry :(

Is there any way I can help?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Sorry

No worries - I'm being overly dramatic :)

Is there any way I can help?

Let me try to get it to compile, and then I'll commit. I'm sure a bunch of tests will fail, help would be great then. Also a general review of my changes to IndexWriter/DocumentsWriter/DWPT would be great. I should be able to commit my merge by end of today.

@asfimport
Copy link
Author

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment