Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per thread DocumentsWriters that write their own private segments [LUCENE-2324] #3400

Closed
asfimport opened this issue Mar 15, 2010 · 241 comments
Closed

Comments

@asfimport
Copy link

asfimport commented Mar 15, 2010

See #3369 for motivation and more details.

I'm copying here Mike's summary he posted on 2293:

Change the approach for how we buffer in RAM to a more isolated
approach, whereby IW has N fully independent RAM segments
in-process and when a doc needs to be indexed it's added to one of
them. Each segment would also write its own doc stores and
"normal" segment merging (not the inefficient merge we now do on
flush) would merge them. This should be a good simplification in
the chain (eg maybe we can remove the *PerThread classes). The
segments can flush independently, letting us make much better
concurrent use of IO & CPU.


Migrated from LUCENE-2324 by Michael Busch, 1 vote, resolved Apr 28 2011
Attachments: ASF.LICENSE.NOT.GRANTED--lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch (versions: 4), LUCENE-2324-SMALL.patch (versions: 5), test.out (versions: 4)
Linked issues:

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Here is an interesting article about allocation/deallocation on modern JVMs:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html

And here is a snippet that mentions how pooling is generally not faster anymore:


Allocation in JVMs was not always so fast – early JVMs indeed had poor allocation and garbage collection performance, which is almost certainly where this myth got started. In the very early days, we saw a lot of "allocation is slow" advice – because it was, along with everything else in early JVMs – and performance gurus advocated various tricks to avoid allocation, such as object pooling. (Public service announcement: Object pooling is now a serious performance loss for all but the most heavyweight of objects, and even then it is tricky to get right without introducing concurrency bottlenecks.) However, a lot has happened since the JDK 1.0 days; the introduction of generational collectors in JDK 1.2 has enabled a much simpler approach to allocation, greatly improving performance.


@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Sounds great – let's test it in practice.

@asfimport
Copy link
Author

asfimport commented Mar 15, 2010

Michael Busch (migrated from JIRA)

Reply to Mike's comment on #3369: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12845263&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12845263

I think we can do even better, ie, that class wastes RAM for the single posting case (intStart, byteStart, lastDocID, docFreq, lastDocCode, lastDocPosition are not needed).

EG we could have a separate class dedicated to the singleton case. When term is first encountered it's enrolled there. We'd probably need a separate hash to store these (though not necessarily?). If it's seen again it's switched to the full posting.

Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once.

I mean instead of allocating an instance per unique term, we assign an integer ID (dense, ie, 0, 1, 2...).

And then we have an array for each member now in FreqProxTermsWriter.PostingList, ie int[] docFreqs, int [] lastDocIDs, etc. Then to look up say the lastDocID for a given postingID you just get lastDocIDs[postingID]. If we're worried about oversize allocation overhead, we can make these arrays paged... but that'd slow down each access.

Yeah I like that idea. I've done something similar for representing trees - I had a very compact Node class with no data but such a dense ID, and arrays that stored the associated data. Very easy to add another data type with no RAM overhead (you only use the amount of RAM the data needs).

Though, the price you pay is for dereferencing multiple times for each array?
And how much RAM would we safe? The pointer for the PostingList object (4-8 bytes), plus the size of the object header - how much is that in Java?

Seems ilke it's 8 bytes: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Sounds great - let's test it in practice.

I have to admit that I need to catch up a bit on the flex branch. I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code? Is it as fast as trunk already, or are there related nocommits left that affect indexing performance? I would think not much of the flex changes should affect the in-memory indexing performance (in TermsHash*).

@asfimport
Copy link
Author

Earwin Burrfoot (migrated from JIRA)

> Seems ilke it's 8 bytes
Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?)

Also, GC time is (roughly) linear in number of objects on heap, so replacing single huge array of objects with few huge primitive arrays for their fields does miracles to your GC delays.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm I think we'd need a separate hash. Otherwise you have to subclass PostingList for the different cases (freq. vs. non-frequent terms) and do instanceof checks? Or with the parallel arrays idea maybe we could encode more information in the dense ID? E.g. use one bit to indicate if that term occurred more than once.

Or 2 sets of parallel arrays (one for the singletons).... or, something.

So in a 32Bit JVM we would safe 4 bytes (pointer) + 8 bytes (header) - 4 bytes (ID) = 8 bytes. For fields with tons of unique terms that might be worth it?

And also the GC cost.

But it seems like specializing singleton fields will be the bigger win.

I was wondering if it makes sense to make these kinds of experiments (pooling vs. non-pooling) with the flex code?

Last I tested (a while back now) indexing perf was the same – need to
test again w/ recent changes (eg terms index is switching to packed
ints). For pooling vs not I'd just do the experiment on trunk?

And most of this change (changing how postings data is buffered in
RAM) is "above" flex I expect.

But if for some reason you need to start changing index postings
format then you should probably do that on flex.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Seems ilke it's 8 bytes

Object header is two words, so that's 16bytes for 64bit arch. (probably 12 for 64bit+CompressedOops?)

Right, and the pointer'd also be 8 bytes (but compact int stays at 4
bytes) so net/net on 64bit JRE savings would be 16-20 bytes per term.

Another thing we could do if we cutover to parallel arrays is to
switch to packed ints. Many of these fields are horribly wasteful as
ints, eg docFreq or lastPosition.

@asfimport
Copy link
Author

asfimport commented Mar 16, 2010

Jason Rutherglen (migrated from JIRA)

Carrying over from #3388. I'm proposing we for starters have a byte slice writer, lock, move or copy(?) the bytes from the writable byte pool/writer to a read only byte block pool, unlock. This sounds like a fairly self-contained thing that can be unit tested at a low level.

Mike, can you add a bit as to how this could work? Also, what is the IntBlockPool used for?

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Are there going to be issues with the char array buffers as well (ie, will we need to also flush them for concurrency?)

@asfimport
Copy link
Author

asfimport commented Mar 16, 2010

Michael Busch (migrated from JIRA)

Shall we not first try to remove the downstream *PerThread classes and make the DocumentsWriter single-threaded without locking. Then we add a PerThreadDocumentsWriter and DocumentsWriterThreadBinder, which talks to the PerThreadDWs and IW talks to DWTB. We can pick other names :)

When that's done we can think about what kind of locking/synchronization/volatile stuff we need for #3388.

@asfimport
Copy link
Author

asfimport commented Mar 16, 2010

Jason Rutherglen (migrated from JIRA)

Michael,

For #3388, I think the searching isn't going to be an
issue, I've got basic per thread doc writers working (though not
thoroughly tested). I didn't see a great need to rework all the
classes, which even if we did, I'm not sure helps with the byte
array read write issues? I'd prefer to get a proof of concept
more or less working, then refine it from there. I think there's
two main design/implementation issues before we can roll
something out:

  1. A new skip list implementation that at specific intervals
    writes a new skip (ie, single level). Right now in trunk we have
    a multilevel skiplist that requires ahead of time the number of
    docs.

  2. Figure out the low -> high levels of byte/char/int array
    visibility to reader threads. The main challenge here is the
    fact that the DW related code that utilizes this is really hard
    for me to understand enough to know what can be changed, without
    the side effect being bunches of other broken stuff. If there
    was a Directory like class abstraction we could simply override
    and reimplement, we could do that, and maybe there is one, I'm
    not sure yet.

However if reworking the PerThread classes somehow makes the tie
into the IO (eg, the byte array pooling) system abstracted and
easier, then I'm all for it.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

NormsWriterPerField has a growing norm byte array, we'd need a way to read/write lock it...

I think we have concurrency issues in the TermsHash table? Maybe it'd need to be rewritten to use ConcurrentHashMap?

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Actually TermsHashField doesn't need to be concurrent, it's only being written to and the terms concurrent skiplist (was a btree) holds the reference to the posting list. So I think we're good there because terms enum never accesses the terms hash. Nice!

@asfimport
Copy link
Author

asfimport commented Mar 16, 2010

Michael Busch (migrated from JIRA)

I think we all agree that we want to have a single writer thread, multi reader thread model. Only then the thread-safety problems in #3388 can be reduced to visibility (no write-locking). So I think making this change first makes most sense. It involves a bit boring refactoring work unfortunately.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Michael, Agreed, can you outline how you think we should proceed then?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Michael, Agreed, can you outline how you think we should proceed then?

Sorry for not responding earlier...

I'm currently working on removing the PostingList object pooling, because it makes TermsHash and TermsHashPerThread much easier. Have written the patch and all tests pass, though I haven't done performance testing yet. Making TermsHash and TermsHashPerThread smaller will also make the patch here easier which will remove them. I'll post the patch soon.

Next steps I think here are to make everything downstream of DocumentsWriter single-threaded (removal of *PerThread) classes. Then we need to write the DocumentsWriterThreadBinder and have to think about how to apply deletes, commits and rollbacks to all DocumentsWriter instances.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

All tests pass but I have to review if with the changes the memory consumption calculation still works correctly. Not sure if the junits test that?

Also haven't done any performance testing yet.

@asfimport
Copy link
Author

asfimport commented Mar 25, 2010

Jason Rutherglen (migrated from JIRA)

Michael, I'm guessing this patch needs to be updated as per #3405?

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Actually, I just browsed the patch again, I don't think it implements private doc writers as of yet?

I think you're right, we can get this issue completed. LUCENE-2312's path looks clear at this point. Shall I take a whack at it?

@asfimport
Copy link
Author

asfimport commented Mar 25, 2010

Michael Busch (migrated from JIRA)

Hey Jason,

Disregard my patch here. I just experimented with removal of pooling, but then did #3405 instead. TermsHash and TermsHashPerThread are now much simpler, because all the pooling code is gone after 2329 was committed. Should make it a little easier to get this patch done.

Sure it'd be awesome if you could provide a patch here. I can help you, we should just frequently post patches here so that we don't both work on the same areas.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Michael, I'm working on a patch and will post one (hopefully) shortly.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Awesome!

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

I'm a little confused in the flushedDocCount, remap deletes conversion portions of DocWriter. flushedDocCount is used as a global counter, however when we move to per thread doc writers, it won't be global anymore. Is there a different (easier) way to perform remap deletes?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Good question...

When we buffer delete Term/Query we record the current docID as of when that delete had arrived (so that interleaved delete/adds are resolved properly). The docID we record is "absolute" (ie, adds in the base flushedDocCount), so that we can decouple when deletes are materialized (moved into the deletedDocs BitVectors) from when new segments are flushed.

I think we have a couple options.

Option 1 is to use a relative (within the current segment) docID when the deleted Term/Query/docID is first buffered, but then make it absolute only when the segment is finally flushed.

Option 2 is to use a relative docID, but do away with the decoupling, ie force deletions to always flush at the same time the segment is flushed.

I think I like option 1 the best – I suspect the decoupling gains us performance as it allows us to batch up more deletions (doing deletions in batch gets better locality, and also means opening/closing readers left often, in the non-pooling case).

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Mike, lets do option 1. I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base?

Option 2 would use reader cloning?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I think the process of making the doc id absolute is simply adding up the previous segments num docs to be the base?

Right.

Option 2 would use reader cloning?

I don't think so – I think it'd have to pull a SegmentReader for every segment every time we flush a new segment, to resolve the deletions. In the non-pooled case that'd be a newly opened SegmentReader for every segment in the index every time a new segment is flushed.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Currently the doc writer manages the ram buffer size, however
this needs to be implemented across doc writers for this issue
to be complete. IW addDoc returns doFlush from DW. I don't think
doFlush will be useful anymore?

A slightly different memory management needs to be designed.
Right now we allow the user to set the max ram buffer size and
when the doc writer's buffers exceed the ram limit, the buffer
is flushed and the process is complete.

With this issue, the flush logic probably needs to be bumped up
into IW, and flushing becomes a multi-docwriter ram usage
examination. For starters, if the aggregate ram usage of all doc
writers exceeds the IWC defined ram buffer size, we need to
schedule flushing the doc writer with the greatest ram usage? I
wonder if there's something I'm missing here in regards to
synchronization issues with DW?

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Following up on the previous comment, if the current thread (the one calling add doc) is also the one that needs to do the flushing, then only the thread attached to the doc writer with the greatest ram usage can/should do the flushing?

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

The easiest would be if each DocumentsWriterPerThread had a fixed buffer size, then they can flush fully independently and you don't need to manage RAM globally across threads.

Of course then you'd need two config parameters: number of concurrent threads and buffer size per thread.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

But if 1 thread tends to index lots of biggish docs... don't we want to allow it to use up more than 1/nth?

Ie we don't want to flush unless total RAM usage has hit the limit?

@asfimport
Copy link
Author

asfimport commented Jan 29, 2011

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I opened #3971.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

I had to read this a few times, yes it's very elegant as we're skipping the postings that otherwise would be deleted immediately after flush, and we're reusing the terms map already in DWPT.

@asfimport
Copy link
Author

Simon Willnauer (@s1monw) (migrated from JIRA)

Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time.

@asfimport
Copy link
Author

asfimport commented Feb 25, 2011

Michael Busch (migrated from JIRA)

Somehow, we have to let each DWPT have some privacy, but, the field name -> number binding should be "global". I think Simon is going to open a separate issue to make something possible along these lines...

This is done now (#3955) and merged into the RT branch.

The current plan w/ deletes is that a delete gets buffered 1) into the global pool (stored in DW and pushed whenever any DWPT flushes), as well as 2) per DWPT. The per-DWPT pools apply only to the segment flushed from that DWPT, while the global pool applies during coalescing (ie to all "prior" segments).

I implemented and committed this approach. It's looking pretty good - almost all tests pass. Only TestStressIndexing2 is sometimes failing - but only when updateDocument() is called, not when I modify the test to only use add, delete-by-term and delete-by-query.

To avoid the full-stop, I think during the flush we can have two global delete pools. We carefully sweep all DWPTs and flush each, in succession. Any DWPT not yet flushed is free to continue indexing as normal, putting deletes into the first global pool, flushing as normal. But, a DWPT that has been flushed by the "sweeper" must instead put deletes for an updateDocument carefully into the 2nd pool, and not buffer the delete into DWPTs not yet flushed.

I haven't done this yet - it might fix the failing test I described.

@asfimport
Copy link
Author

asfimport commented Feb 25, 2011

Michael Busch (migrated from JIRA)

Can anyone gimme a quick statement about what is left here or what the status of this issue is? I am at the point where I need to do some rather big changes to DocValues which I would not need if we have DWPT so I might rather help here before wasting time.

I think it's very close! The new deletes approach is implemented, and various bugs are fixed. Also the latest trunk is merged in (including LUCENE-2881).
Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM (#3647).

Other than TestStressIndexing2 and TestNRTThreads (updateDocument problem) and a few tests that rely on flush-by-RAM, all core and contrib tests are passing now.

@asfimport
Copy link
Author

asfimport commented Feb 25, 2011

Simon Willnauer (@s1monw) (migrated from JIRA)

Outstanding issues are to fix the updateDocument() problems, and finish flush-by-RAM (#3647).

Seems like #3647 is more isolated than the updateDocument() issue so I think I can spend time on that one without interfering with what you are working on. I might need some time to get into what has been done so far, might come back here or on the list if I have questions.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

One nice side effect of DWPT is it should allow us to get over the 2GB RAM buffer limit, assuming you use multiple threads.

Ie I can set my RAM buffer to 10 GB, and if I'm using 5 threads, it should work.

Not sure it's really meaningful in practice, since in past tests I haven't seen any gains over ~128 or 256 MB buffer... but maybe that's changed now.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else?

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Is the max optimal DWPT size related to the size of the terms hash, or is it likely something else?

Bigger really should be better I think.

Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required.

I'm not sure why in the past perf seemd to taper off and maybe get worst after RAM buffer was over 256 MB... we should definitely re-test.

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Because 1) the RAM efficiency ought to scale up very well, as you see a given term in more and more docs (hmm, though, maybe not, because from Zipf's law, half your terms will be singletons no matter how many docs you index), and 2) less merging is required.

I'm not sure how we handled concurrency on the terms hash before, however with DWPTs there won't be contention regardless. It'd be nice if we could build 1-2 GB segment's in RAM, I think that would greatly reduce the number merges that are required downstream. Eg, then there's less need for merging by size, and most merges would be caused by the number/percentage of deletes. If it turns out the low DF terms are causing the slowdown, maybe there is a different hashing system that could be used.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Concurrency today is similar to DWPT in that we simply write into multiple segments in RAM.

But the, on flush, we do a merge (in RAM) of these segments and write a single on-disk segment.

Vs this change which instead writes N on-disk segments and lets "normal" merging merge them.

I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high...

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

I think making a different data structure to hold low-DF terms would actually be a big boost in RAM efficiency. The RAM-per-unique-term is fairly high...

However we're not sure why a largish 1+ GB RAM buffer seems to slow down? If we're round robin indexing against the DWPTs I think they'll have a similar number of unique terms as today, even though each DWPT will be smaller in size total size from each containing 1/Nth docs.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

The slowdown could have been due to the merge sort by docID that we do today on flush.

Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs).

So maybe after this lands we'll see only faster performance the larger the RAM buffer :) That would be nice!

@asfimport
Copy link
Author

Jason Rutherglen (migrated from JIRA)

Ie, if a given term X occurrs in 6 DWPTs (today) then we merge-sort the docIDs from the postings of that term, which is costly. (The "normal" merge that will merge these DWPTs after this issue lands just append by docIDs).

Right, this is the same principal motivation behind implementing DWPTs for use with realtime search, eg, the doc-id interleaving is too expensive to be performed at query time.

@asfimport
Copy link
Author

asfimport commented Apr 14, 2011

Simon Willnauer (@s1monw) (migrated from JIRA)

guys I opened #4097 to land on trunk! can I close this and we iterate on #4097 from now on?

simon

@asfimport
Copy link
Author

asfimport commented Apr 28, 2011

Simon Willnauer (@s1monw) (migrated from JIRA)

we land this on trunk via #4097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant