-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per thread DocumentsWriters that write their own private segments [LUCENE-2324] #3400
Comments
Michael Busch (migrated from JIRA) Here is an interesting article about allocation/deallocation on modern JVMs: And here is a snippet that mentions how pooling is generally not faster anymore: Allocation in JVMs was not always so fast – early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice – because it was, along with everything else in early JVMs – and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.) However, a lot has happened since the JDK 1.0 days; the introduction of generational collectors in JDK 1.2 has enabled a much simpler approach to allocation, greatly improving performance. |
Michael McCandless (@mikemccand) (migrated from JIRA) Sounds great – let's test it in practice. |
Michael Busch (migrated from JIRA) Reply to Mike's comment on #3369: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263
Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once.
Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs). Though, the price you pay is for dereferencing multiple times for each array? Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it? |
Michael Busch (migrated from JIRA)
I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*). |
Earwin Burrfoot (migrated from JIRA) > Seems ilke it's 8 bytes Also, GC time is (roughly) linear in number of objects on heap, so replacing single huge array of objects with few huge primitive arrays for their fields does miracles to your GC delays. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Or 2 sets of parallel arrays (one for the singletons).... or, something.
And also the GC cost. But it seems like specializing singleton fields will be the bigger win.
Last I tested (a while back now) indexing perf was the same – need to And most of this change (changing how postings data is buffered in But if for some reason you need to start changing index postings |
Michael McCandless (@mikemccand) (migrated from JIRA)
Right, and the pointer'd also be 8 bytes (but compact int stays at 4 Another thing we could do if we cutover to parallel arrays is to |
Jason Rutherglen (migrated from JIRA) Carrying over from #3388. I'm proposing we for starters have a byte slice writer, lock, move or copy(?) the bytes from the writable byte pool/writer to a read only byte block pool, unlock. This sounds like a fairly self-contained thing that can be unit tested at a low level. Mike, can you add a bit as to how this could work? Also, what is the IntBlockPool used for? |
Jason Rutherglen (migrated from JIRA) Are there going to be issues with the char array buffers as well (ie, will we need to also flush them for concurrency?) |
Michael Busch (migrated from JIRA) Shall we not first try to remove the downstream *PerThread classes and make the DocumentsWriter single-threaded without locking. Then we add a PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the PerThreadDWs and IW talks to DWTB. We can pick other names :) When that's done we can think about what kind of locking/synchronization/volatile stuff we need for #3388. |
Jason Rutherglen (migrated from JIRA) Michael, For #3388, I think the searching isn't going to be an
However if reworking the PerThread classes somehow makes the tie |
Jason Rutherglen (migrated from JIRA) NormsWriterPerField has a growing norm byte array, we'd need a way to read/write lock it... I think we have concurrency issues in the TermsHash table? Maybe it'd need to be rewritten to use ConcurrentHashMap? |
Jason Rutherglen (migrated from JIRA) Actually TermsHashField doesn't need to be concurrent, it's only being written to and the terms concurrent skiplist (was a btree) holds the reference to the posting list. So I think we're good there because terms enum never accesses the terms hash. Nice! |
Michael Busch (migrated from JIRA) I think we all agree that we want to have a single writer thread, multi reader thread model. Only then the thread-safety problems in #3388 can be reduced to visibility (no write-locking). So I think making this change first makes most sense. It involves a bit boring refactoring work unfortunately. |
Jason Rutherglen (migrated from JIRA) Michael, Agreed, can you outline how you think we should proceed then? |
Michael Busch (migrated from JIRA)
Sorry for not responding earlier... I'm currently working on removing the PostingList object pooling, because it makes TermsHash and TermsHashPerThread much easier. Have written the patch and all tests pass, though I haven't done performance testing yet. Making TermsHash and TermsHashPerThread smaller will also make the patch here easier which will remove them. I'll post the patch soon. Next steps I think here are to make everything downstream of DocumentsWriter single-threaded (removal of *PerThread) classes. Then we need to write the DocumentsWriterThreadBinder and have to think about how to apply deletes, commits and rollbacks to all DocumentsWriter instances. |
Michael Busch (migrated from JIRA) All tests pass but I have to review if with the changes the memory consumption calculation still works correctly. Not sure if the junits test that? Also haven't done any performance testing yet. |
Jason Rutherglen (migrated from JIRA) Michael, I'm guessing this patch needs to be updated as per #3405? |
Jason Rutherglen (migrated from JIRA) Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet? I think you're right, we can get this issue completed. LUCENE-2312's path looks clear at this point. Shall I take a whack at it? |
Michael Busch (migrated from JIRA) Hey Jason, Disregard my patch here. I just experimented with removal of pooling, but then did #3405 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done. Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas. |
Jason Rutherglen (migrated from JIRA) Michael, I'm working on a patch and will post one (hopefully) shortly. |
Michael Busch (migrated from JIRA) Awesome! |
Jason Rutherglen (migrated from JIRA) I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes? |
Michael McCandless (@mikemccand) (migrated from JIRA) Good question... When we buffer delete Term/Query we record the current docID as of when that delete had arrived (so that interleaved delete/adds are resolved properly). The docID we record is "absolute" (ie, adds in the base flushedDocCount), so that we can decouple when deletes are materialized (moved into the deletedDocs BitVectors) from when new segments are flushed. I think we have a couple options. Option 1 is to use a relative (within the current segment) docID when the deleted Term/Query/docID is first buffered, but then make it absolute only when the segment is finally flushed. Option 2 is to use a relative docID, but do away with the decoupling, ie force deletions to always flush at the same time the segment is flushed. I think I like option 1 the best – I suspect the decoupling gains us performance as it allows us to batch up more deletions (doing deletions in batch gets better locality, and also means opening/closing readers left often, in the non-pooling case). |
Jason Rutherglen (migrated from JIRA) Mike, lets do option 1. I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base? Option 2 would use reader cloning? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Right.
I don't think so – I think it'd have to pull a SegmentReader for every segment every time we flush a new segment, to resolve the deletions. In the non-pooled case that'd be a newly opened SegmentReader for every segment in the index every time a new segment is flushed. |
Jason Rutherglen (migrated from JIRA) Currently the doc writer manages the ram buffer size, however A slightly different memory management needs to be designed. With this issue, the flush logic probably needs to be bumped up |
Jason Rutherglen (migrated from JIRA) Following up on the previous comment, if the current thread (the one calling add doc) is also the one that needs to do the flushing, then only the thread attached to the doc writer with the greatest ram usage can/should do the flushing? |
Michael Busch (migrated from JIRA) The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, then they can flush fully independently and you don't need to manage RAM globally across threads. Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread. |
Michael McCandless (@mikemccand) (migrated from JIRA) But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth? Ie we don't want to flush unless total RAM usage has hit the limit? |
Michael McCandless (@mikemccand) (migrated from JIRA) OK, I opened #3971. |
Jason Rutherglen (migrated from JIRA) I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT. |
Simon Willnauer (@s1monw) (migrated from JIRA) Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time. |
Michael Busch (migrated from JIRA)
This is done now (#3955) and merged into the RT branch.
I implemented and committed this approach. It's looking pretty good - almost all tests pass. Only TestStressIndexing2 is sometimes failing - but only when updateDocument() is called, not when I modify the test to only use add, delete-by-term and delete-by-query.
I haven't done this yet - it might fix the failing test I described. |
Michael Busch (migrated from JIRA)
I think it's very close! The new deletes approach is implemented, and various bugs are fixed. Also the latest trunk is merged in (including LUCENE-2881). Other than TestStressIndexing2 and TestNRTThreads (updateDocument problem) and a few tests that rely on flush-by-RAM, all core and contrib tests are passing now. |
Simon Willnauer (@s1monw) (migrated from JIRA)
Seems like #3647 is more isolated than the updateDocument() issue so I think I can spend time on that one without interfering with what you are working on. I might need some time to get into what has been done so far, might come back here or on the list if I have questions. |
Michael McCandless (@mikemccand) (migrated from JIRA) One nice side effect of DWPT is it should allow us to get over the 2GB RAM buffer limit, assuming you use multiple threads. Ie I can set my RAM buffer to 10 GB, and if I'm using 5 threads, it should work. Not sure it's really meaningful in practice, since in past tests I haven't seen any gains over ~128 or 256 MB buffer... but maybe that's changed now. |
Jason Rutherglen (migrated from JIRA) Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else? |
Michael McCandless (@mikemccand) (migrated from JIRA)
Bigger really should be better I think. Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required. I'm not sure why in the past perf seemd to taper off and maybe get worst after RAM buffer was over 256 MB... we should definitely re-test. |
Jason Rutherglen (migrated from JIRA)
I'm not sure how we handled concurrency on the terms hash before, however with DWPTs there won't be contention regardless. It'd be nice if we could build 1-2 GB segment's in RAM, I think that would greatly reduce the number merges that are required downstream. Eg, then there's less need for merging by size, and most merges would be caused by the number/percentage of deletes. If it turns out the low DF terms are causing the slowdown, maybe there is a different hashing system that could be used. |
Michael McCandless (@mikemccand) (migrated from JIRA) Concurrency today is similar to DWPT in that we simply write into multiple segments in RAM. But the, on flush, we do a merge (in RAM) of these segments and write a single on-disk segment. Vs this change which instead writes N on-disk segments and lets "normal" merging merge them. I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high... |
Jason Rutherglen (migrated from JIRA)
However we're not sure why a largish 1+ GB RAM buffer seems to slow down? If we're round robin indexing against the DWPTs I think they'll have a similar number of unique terms as today, even though each DWPT will be smaller in size total size from each containing 1/Nth docs. |
Michael McCandless (@mikemccand) (migrated from JIRA) The slowdown could have been due to the merge sort by docID that we do today on flush. Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs). So maybe after this lands we'll see only faster performance the larger the RAM buffer :) That would be nice! |
Jason Rutherglen (migrated from JIRA)
Right, this is the same principal motivation behind implementing DWPTs for use with realtime search, eg, the doc-id interleaving is too expensive to be performed at query time. |
Simon Willnauer (@s1monw) (migrated from JIRA) guys I opened #4097 to land on trunk! can I close this and we iterate on #4097 from now on? simon |
Simon Willnauer (@s1monw) (migrated from JIRA) we land this on trunk via #4097 |
See #3369 for motivation and more details.
I'm copying here Mike's summary he posted on 2293:
Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
"normal" segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO & CPU.
Migrated from LUCENE-2324 by Michael Busch, 1 vote, resolved Apr 28 2011
Attachments: ASF.LICENSE.NOT.GRANTED--lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch (versions: 4), LUCENE-2324-SMALL.patch (versions: 5), test.out (versions: 4)
Linked issues:
The text was updated successfully, but these errors were encountered: