-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve indexing performance by increasing internal buffer sizes [LUCENE-888] #1963
Comments
Michael Busch (migrated from JIRA) > At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if Cool! > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE I submitted a patch for #1508 which avoids copying the buffer when > The CompoundFileWriter buffer is created only briefly, so I think we I'm wondering how much performance benefits if you increase the buffer |
Michael McCandless (@mikemccand) (migrated from JIRA) > > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE I like that idea! I am actually seeing that increased buffer sizes for I wonder if we should just add a ctor to BufferedIndexInput that takes Maybe we do the setBufferSize approach, but, if the buffer already > > The CompoundFileWriter buffer is created only briefly, so I think we I'm testing now different sizes of each of these three buffers |
Michael Busch (migrated from JIRA) > I wonder if we should just add a ctor to BufferedIndexInput that takes Yeah I was thinking about the ctor approach as well. Actually public IndexInput openInput(String name, int bufferSize) This should solve the problems you mentioned like in SegmentTermEnum After a clone however, we would still have to cast to |
Michael McCandless (@mikemccand) (migrated from JIRA) > > I wonder if we should just add a ctor to BufferedIndexInput that takes Actually, it does have a default public constructor right? Ie if we add public BufferedIndexInput() then I think we don't break API backwards compatibility? > After a clone however, we would still have to cast to I plan to add "private int bufferSize" to BufferedIndexInput, |
Michael Busch (migrated from JIRA) > Actually, it does have a default public constructor right? Ie if we add > public BufferedIndexInput() > then I think we don't break API backwards compatibility? Oups! Of course, you are right. What was I thinking... > I plan to add "private int bufferSize" to BufferedIndexInput, True. But it would be nice if it was possible to change the buffer size Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the > Maybe we do the setBufferSize approach, but, if the buffer already So yes, I think we should implement it this way. |
Michael McCandless (@mikemccand) (migrated from JIRA) > > I plan to add "private int bufferSize" to BufferedIndexInput, OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also > Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the OK I will do this. Actually, I think we should also allow making the |
Michael Busch (migrated from JIRA) > OK I will do this. Actually, I think we should also allow making the Yep sounds good. I can code this and commit it with #1508. |
Michael McCandless (@mikemccand) (migrated from JIRA) > > OK I will do this. Actually, I think we should also allow making the I'm actually coding it up now. Why don't you commit #1508 |
Marvin Humphrey (migrated from JIRA) I would like to know why these gains are appearing, and how specific they are to a particular system. How can the optimum buffer size be deduced? Is it a factor of hard disk sector size? Memory page size? Lucene write behavior pattern? Level X Cache size? |
Michael Busch (migrated from JIRA) > I'm actually coding it up now. Why don't you commit #1508 Done. |
Michael McCandless (@mikemccand) (migrated from JIRA) OK I ran two sets of tests. First is only on Mac OS X to see how The performance gains are 10-18% faster overall. FIRST TEST I increased buffer sizes, separately, for each of BufferedIndexInput, BufferedIndexInput
CompoundFileWriter
BufferedIndexOutput
Comments:
Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer Then, I re-tested the baseline (trunk) & these buffer sizes across SECOND TEST Baseline (trunk) = 1 K buffers for all 3. New = 16 K for I ran each test 4 times & took the best time: Quad core Mac OS X on 4-drive RAID 0 Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5 Windows XP Pro laptop, single drive Net/net it's between 10-18% performance gain overall. It is |
Michael McCandless (@mikemccand) (migrated from JIRA) > I would like to know why these gains are appearing, and how specific It looks like the gains are cross platform (at least between OS X, I'm not sure how this depends/correlates to the various cache/page It must be that doing an IO request has a fairly high overhead and so For merging in particular, with mergeFactor=10, I can see that a And some good news: these gains seem to be additive to the gains in |
John Haxby (migrated from JIRA) > Net/net it's between 10-18% performance gain overall. It is Actually, it's not that surprising. Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account. The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k). You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k. I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k. In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer. I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help. I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though. |
Michael McCandless (@mikemccand) (migrated from JIRA) Attached the patch based on above discussion. |
Michael Busch (migrated from JIRA) Mike, I tested and reviewed your patch. It looks good and all tests pass! Two comments:
|
Michael McCandless (@mikemccand) (migrated from JIRA) > I tested and reviewed your patch. It looks good and all tests pass! Thanks! > - Should we increase the buffer size for CompoundFileReader to 4KB I'm a little nervous about that: I don't know the impact it will have Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when It almost seems like the double copy would not occur becaase > In BufferedIndexInput.setBufferSize() a new buffer should only be Ahh, good. Will do. |
Michael Busch (migrated from JIRA) > I'm a little nervous about that: I don't know the impact it will have Doesn't the OS always read at least a whole block from the disk (usually > Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when Good catch! Reminds me a bit of #1509 where we also did double |
Michael McCandless (@mikemccand) (migrated from JIRA) > > I'm a little nervous about that: I don't know the impact it will have Yes, I think you're right. But we should test search |
Doug Cutting (@cutting) (migrated from JIRA) > then we don't save IO by limiting the buffer size to 1 KB I'm confused by this. My assumption is that, when you make a request to read 1k from a disk file, that the OS reads substantially more than 1k from the disk and places it in the buffer cache. (The cost of randomly reading 1k is nearly the same as randomly reading 100k--both are dominated by seek.) So, if you make another request to read 1k shortly thereafter you'll get it from the buffer cache and the incremental cost will be that of making a system call. In general, one should thus rely on the buffer cache and read-ahead, and make input buffers only big enough so that system call overhead is insignificant. An alternate strategy is to not trust the buffer cache and read-ahead, but rather to make your buffers large enough so that transfer time dominates seeks. This can require 1MB or larger buffers, so isn't always practical. So, back to your statement, a 1k buffer doesn't save physical i/o, but nor should it incur extra physical i/o. It does incur extra system calls, but uses less memory, which is a tradeoff. Is that what you meant? |
robert engels (migrated from JIRA) I think the important consideration is how expensive is the system call. Since the system call requires JNI, it MIGHT be expensive. Another important consideration is buffer utilization. It is my understanding that the OS will perform read-ahead normally only in sequential access only, outside of the additional bytes read to optimize the physical read. If Lucene performs indexed reads but the data is actually being accessed sequential, Lucene managing its own buffers can far more effective. Along these lines, if the server is heavily used for a variety of applications Lucene can manage its own buffers more efficiently - similar to how a database almost always (every commercial one I know) has its own buffer cache and does not rely on the OS. In a GC environment though it may be even more imporant if the buffers were managed/reused from a pool as you avoid the GC overhead. Just my thoughts. |
Michael Busch (migrated from JIRA) > So, back to your statement, a 1k buffer doesn't save Yes, I agree. > It does incur extra system calls, but uses less memory, A coworker told me that he ran some experiments with buffer So yes, it is a tradeoff between memory usage and amount But I'm just sort of guessing here, I think we should run |
Michael Busch (migrated from JIRA) Mike, another thing I just noticed is your new testcase doesn't remove |
Eks Dev (migrated from JIRA) we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff." in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls. The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk. And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not. |
Michael McCandless (@mikemccand) (migrated from JIRA) > another thing I just noticed is your new testcase doesn't remove the Woops, I will fix. I will also fix it to appear under the tempDir. |
Michael McCandless (@mikemccand) (migrated from JIRA) > we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff." Interesting! Do you remember if your 5-8% gain was for searching or |
Michael McCandless (@mikemccand) (migrated from JIRA) New patch with these changes:
|
Michael Busch (migrated from JIRA) Take2 looks good. +1 for committing. |
Marvin Humphrey (migrated from JIRA) I have some auxiliary data points to report after experimenting with buffer The FS i/o classes in KinoSearch use a FILE* and
So, it seems that the only change I can make moves the numbers in the wrong The results are somewhat puzzling because I would ordinarily have blamed |
Michael McCandless (@mikemccand) (migrated from JIRA) Marvin, it's odd that you see no gains when coming straight from C. I wonder if IO calls from Java somehow have a sizable overhead that Also, how much "merging" is actually done in your test / KS? How many |
Marvin Humphrey (migrated from JIRA) > Also, how much "merging" is actually done in your test / KS? How many In the previous test, I was indexing 1000 Reuters articles all in one session. I reran the test on the FreeBSD box, changing it so that the index was built |
Michael McCandless (@mikemccand) (migrated from JIRA) I re-ran the "second test" above, but this time with compound file Baseline (trunk) = 1 K buffers for all 3. New = 16 K for Quad core Mac OS X on 4-drive RAID 0 Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5 Windows XP Pro laptop, single drive Quick observations:
OK I plan to commit this soon. |
In working on #1918, I noticed that two buffer sizes have a
substantial impact on overall indexing performance.
First is BufferedIndexOutput.BUFFER_SIZE (also used by
BufferedIndexInput). Second is CompoundFileWriter's buffer used to
actually build the compound file. Both are now 1 KB (1024 bytes).
I ran the same indexing test I'm using for #1918. I'm indexing
~5,500 byte plain text docs derived from the Europarl corpus
(English). I index 200,000 docs with compound file enabled and term
vector positions & offsets stored plus stored fields. I flush
documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
not hit #1920. The resulting index is 1.7 GB. The index is not
optimized in the end and I left mergeFactor
@ 10
.I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
system.
At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
I increase both buffers to 8 KB it takes 554 sec to build the index,
which is an 11% overall gain!
I will run more tests to see if there is a natural knee in the curve
(buffer size above which we don't really gain much more performance).
I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
at 1024, at least for now. During searching there can be quite a few
of this class instantiated, and likely a larger buffer size for the
freq/prox streams could actually hurt search performance for those
searches that use skipping.
The CompoundFileWriter buffer is created only briefly, so I think we
can use a fairly large (32 KB?) buffer there. And there should not be
too many BufferedIndexOutputs alive at once so I think a large-ish
buffer (16 KB?) should be OK.
Migrated from LUCENE-888 by Michael McCandless (@mikemccand), 2 votes, resolved May 29 2007
Attachments: LUCENE-888.patch, LUCENE-888.take2.patch
Linked issues:
The text was updated successfully, but these errors were encountered: