-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply delete-by-Term and docID immediately to newly flushed segments [LUCENE-2897] #3971
Comments
Michael McCandless (@mikemccand) (migrated from JIRA) Initial patch. The approach is nice and simple ;) All tests pass, but I put a bunch of nocommits in there, and I stopped short of the stuff I'll have to redo after committing ooo merges. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Well... I think we can't [easily] skip writing the postings, because could result in non-deterministic behavior (I put a comment on this in the patch). If we did the flush w/ 2 passes (first pass to mark all del docs and 2nd to flush) then we could skip writing postings of docs that were deleted. But I suspect that's too much cost on flush. With a single pass, we'd end up writing some postings for the doc, but not all, depending on the order in which its terms arrived vs its deleted terms. I mean, in practice, an app is gonna delete against ID field (typically) so if we "knew" that down deep here in Luceneland we could do the first pass only against that one field... Also, merge is still going to have to apply del docs, since eg stored fields have written the deleted docs. |
Jason Rutherglen (migrated from JIRA)
Instead we're building the deleted docs BV as we're flushing. |
Michael Busch (migrated from JIRA) Yeah this is nice. I was thinking we'd switch to live deletes with RT, because then we can also handle delete-by-query like this. So the deleted queries we still have to buffer per DWPT, but this solves the updateDocument() problem. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Right, that's what the patch does. Seems to work great :)
Right, though for RT we need to apply the delete immediately (vs this patch which applies on flush). And delete-by-Query is still buffered... |
Michael McCandless (@mikemccand) (migrated from JIRA) Patch – I think it's ready to commit! |
Michael McCandless (@mikemccand) (migrated from JIRA) I think we can do this for 3.1. |
Michael McCandless (@mikemccand) (migrated from JIRA) I changed my mind! Pushing this to 3.2 now. |
Robert Muir (@rmuir) (migrated from JIRA) Bulk closing for 3.2 |
Spinoff from #3400.
When we flush deletes today, we keep them as buffered Term/Query/docIDs that need to be deleted. But, for a newly flushed segment (ie fresh out of the DWPT), this is silly, because during flush we visit all terms and we know their docIDs. So it's more efficient to apply the deletes (for this one segment) at that time.
We still must buffer deletes for all prior segments, but these deletes don't need to map to a docIDUpto anymore; ie we just need a Set.
This issue should wait until #2153 is in since that issue cuts over buffered deletes to a transactional stream.
Migrated from LUCENE-2897 by Michael McCandless (@mikemccand), resolved May 05 2011
Attachments: LUCENE-2897.patch (versions: 2)
The text was updated successfully, but these errors were encountered: