Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve indexing performance by increasing internal buffer sizes [LUCENE-888] #1963

Closed
asfimport opened this issue May 23, 2007 · 31 comments
Closed

Comments

@asfimport
Copy link

asfimport commented May 23, 2007

In working on #1918, I noticed that two buffer sizes have a
substantial impact on overall indexing performance.

First is BufferedIndexOutput.BUFFER_SIZE (also used by
BufferedIndexInput). Second is CompoundFileWriter's buffer used to
actually build the compound file. Both are now 1 KB (1024 bytes).

I ran the same indexing test I'm using for #1918. I'm indexing
~5,500 byte plain text docs derived from the Europarl corpus
(English). I index 200,000 docs with compound file enabled and term
vector positions & offsets stored plus stored fields. I flush
documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
not hit #1920. The resulting index is 1.7 GB. The index is not
optimized in the end and I left mergeFactor @ 10.

I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
system.

At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
I increase both buffers to 8 KB it takes 554 sec to build the index,
which is an 11% overall gain!

I will run more tests to see if there is a natural knee in the curve
(buffer size above which we don't really gain much more performance).

I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
at 1024, at least for now. During searching there can be quite a few
of this class instantiated, and likely a larger buffer size for the
freq/prox streams could actually hurt search performance for those
searches that use skipping.

The CompoundFileWriter buffer is created only briefly, so I think we
can use a fairly large (32 KB?) buffer there. And there should not be
too many BufferedIndexOutputs alive at once so I think a large-ish
buffer (16 KB?) should be OK.


Migrated from LUCENE-888 by Michael McCandless (@mikemccand), 2 votes, resolved May 29 2007
Attachments: LUCENE-888.patch, LUCENE-888.take2.patch
Linked issues:

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael Busch (migrated from JIRA)

> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!

Cool!

> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now. During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.

I submitted a patch for #1508 which avoids copying the buffer when
a BufferedIndexInput is cloned. With this patch we could also add a
method setBufferSize(int) to BufferedIndexInput. This method has to
be called then right after the input has been created or cloned and
before the first read is performed (the first read operation allocates
the buffer). If called later it wouldn't have any effect. This would
allow us to adjust the buffer size dynamically, e. g. use large buffers
for segment merges and stored fields, but smaller ones for freq/prox
streams, maybe dependent on the document frequency.
What do you think?

> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there. And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

I'm wondering how much performance benefits if you increase the buffer
size beyond the file system's page size? Does it make a big difference
if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
the numbers you propose are good.

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> > at 1024, at least for now. During searching there can be quite a few
> > of this class instantiated, and likely a larger buffer size for the
> > freq/prox streams could actually hurt search performance for those
> > searches that use skipping.
>
> I submitted a patch for #1508 which avoids copying the buffer when
> a BufferedIndexInput is cloned. With this patch we could also add a
> method setBufferSize(int) to BufferedIndexInput. This method has to
> be called then right after the input has been created or cloned and
> before the first read is performed (the first read operation allocates
> the buffer). If called later it wouldn't have any effect. This would
> allow us to adjust the buffer size dynamically, e. g. use large buffers
> for segment merges and stored fields, but smaller ones for freq/prox
> streams, maybe dependent on the document frequency.
> What do you think?

I like that idea!

I am actually seeing that increased buffer sizes for
BufferedIndexInput help performance of indexing as well (up to ~5%
just changing this buffer), so I think we do want to increase this but
only for merging.

I wonder if we should just add a ctor to BufferedIndexInput that takes
the bufferSize? This would avoid the surprising API caveat you
describe above. The problem is, then all classes (SegmentTermDocs,
SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
would also have to have ctors to change buffer sizes. Even if we do
setBufferSize instead of new ctor we have some cases (eg at least
SegmentTermEnum) where bytes are read during construction so it's too
late for caller to then change buffer size. Hmmm. Not clear how to
do this cleanly...

Maybe we do the setBufferSize approach, but, if the buffer already
exists, rather than throwing an exception we check if the new size is
greater than the old size and if so we grow the buffer? I can code this
up.

> > The CompoundFileWriter buffer is created only briefly, so I think we
> > can use a fairly large (32 KB?) buffer there. And there should not be
> > too many BufferedIndexOutputs alive at once so I think a large-ish
> > buffer (16 KB?) should be OK.
>
> I'm wondering how much performance benefits if you increase the buffer
> size beyond the file system's page size? Does it make a big difference
> if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
> the numbers you propose are good.

I'm testing now different sizes of each of these three buffers
... will post the results.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

> I wonder if we should just add a ctor to BufferedIndexInput that takes
> the bufferSize? This would avoid the surprising API caveat you
> describe above. The problem is, then all classes (SegmentTermDocs,
> SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
> would also have to have ctors to change buffer sizes. Even if we do
> setBufferSize instead of new ctor we have some cases (eg at least
> SegmentTermEnum) where bytes are read during construction so it's too
> late for caller to then change buffer size. Hmmm. Not clear how to
> do this cleanly...

Yeah I was thinking about the ctor approach as well. Actually
BufferedIndexInput does not have a public ctor so far, it's created by
using Directory.openInput(String fileName). And to add a new ctor would
mean an API change, so subclasses wouldn't compile anymore without
changes.
What me might want to do instead is to add a new new method
openInput(String fileName, int bufferSize) to Directory which calls
the existing openInput(String fileName) by default, so subclasses of
Directory would ignore the bufferSize parameter by default. Then we
can change FSDirectory to overwrite openInput(String, int):

public IndexInput openInput(String name, int bufferSize)
throws IOException {
FSIndexInput input = new FSIndexInput(new File(directory, name));
input.setBufferSize(bufferSize);
return input;
}

This should solve the problems you mentioned like in SegmentTermEnum
and we don't have to support setBufferSize() after a read has been
performed. It has also the advantage that we safe an instanceof and
cast from IndexInput to BufferedIndexInput before setBufferSize()
can be called.

After a clone however, we would still have to cast to
BufferedIndexInput before setBufferSize() can be called.

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> > I wonder if we should just add a ctor to BufferedIndexInput that takes
> > the bufferSize? This would avoid the surprising API caveat you
> > describe above. The problem is, then all classes (SegmentTermDocs,
> > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
> > would also have to have ctors to change buffer sizes. Even if we do
> > setBufferSize instead of new ctor we have some cases (eg at least
> > SegmentTermEnum) where bytes are read during construction so it's too
> > late for caller to then change buffer size. Hmmm. Not clear how to
> > do this cleanly...
>
> Yeah I was thinking about the ctor approach as well. Actually
> BufferedIndexInput does not have a public ctor so far, it's created by
> using Directory.openInput(String fileName). And to add a new ctor would
> mean an API change, so subclasses wouldn't compile anymore without
> changes.

Actually, it does have a default public constructor right? Ie if we add

public BufferedIndexInput()
public BufferedIndexInput(int bufferSize)

then I think we don't break API backwards compatibility?

> After a clone however, we would still have to cast to
> BufferedIndexInput before setBufferSize() can be called.

I plan to add "private int bufferSize" to BufferedIndexInput,
defaulting to BUFFER_SIZE. I think then it would just work w/ your
#1508 patch because your patch sets the clone's buffer to null
and then when the clone allocates its buffer it will be length
bufferSize. I think?

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael Busch (migrated from JIRA)

> Actually, it does have a default public constructor right? Ie if we add

> public BufferedIndexInput()
> public BufferedIndexInput(int bufferSize)

> then I think we don't break API backwards compatibility?

Oups! Of course, you are right. What was I thinking...

> I plan to add "private int bufferSize" to BufferedIndexInput,
> defaulting to BUFFER_SIZE. I think then it would just work w/ your
> #1508 patch because your patch sets the clone's buffer to null
> and then when the clone allocates its buffer it will be length
> bufferSize. I think?

True. But it would be nice if it was possible to change the buffer size
after a clone. For example in SegmentTermDocs we could then adjust the
buffer size of the cloned freqStream according to the document frequency.
And in my multi-level skipping patch (#1941) I could also benefit
from this functionality.

Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
same instance of SegmentTermDocs is used for different terms, then
the same clone is used. So actually it would be nice it was possible to
change the buffer size after read has performed.

> Maybe we do the setBufferSize approach, but, if the buffer already
> exists, rather than throwing an exception we check if the new size is
> greater than the old size and if so we grow the buffer? I can code this
> up.

So yes, I think we should implement it this way.

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> > I plan to add "private int bufferSize" to BufferedIndexInput,
> > defaulting to BUFFER_SIZE. I think then it would just work w/ your
> > #1508 patch because your patch sets the clone's buffer to null
> > and then when the clone allocates its buffer it will be length
> > bufferSize. I think?
>
> True. But it would be nice if it was possible to change the buffer size
> after a clone. For example in SegmentTermDocs we could then adjust the
> buffer size of the cloned freqStream according to the document frequency.
> And in my multi-level skipping patch (#1941) I could also benefit
> from this functionality.

OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also
openInput(path, bufferSize) to Directory base class & to FSDirectory.

> Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
> same instance of SegmentTermDocs is used for different terms, then
> the same clone is used. So actually it would be nice it was possible to
> change the buffer size after read has performed.
>
> > Maybe we do the setBufferSize approach, but, if the buffer already
> > exists, rather than throwing an exception we check if the new size is
> > greater than the old size and if so we grow the buffer? I can code this
> > up.
>
> So yes, I think we should implement it this way.

OK I will do this. Actually, I think we should also allow making the
buffer smaller this way. Meaning, I will preserve buffer contents
(starting from bufferPosition) as much as is allowed by the smaller
buffer. This way there is no restriction on using this method
vs. having read bytes already ("principle of least surprise").

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael Busch (migrated from JIRA)

> OK I will do this. Actually, I think we should also allow making the
> buffer smaller this way. Meaning, I will preserve buffer contents
> (starting from bufferPosition) as much as is allowed by the smaller
> buffer. This way there is no restriction on using this method
> vs. having read bytes already ("principle of least surprise").

Yep sounds good. I can code this and commit it with #1508.
If you want to do it, then I should commit the existing 430 patch
soon, so that there are no conflicts in our patches?

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> > OK I will do this. Actually, I think we should also allow making the
> > buffer smaller this way. Meaning, I will preserve buffer contents
> > (starting from bufferPosition) as much as is allowed by the smaller
> > buffer. This way there is no restriction on using this method
> > vs. having read bytes already ("principle of least surprise").
>
> Yep sounds good. I can code this and commit it with #1508.
> If you want to do it, then I should commit the existing 430 patch
> soon, so that there are no conflicts in our patches?

I'm actually coding it up now. Why don't you commit #1508
soonish and then I'll update & merge?

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

I would like to know why these gains are appearing, and how specific they are to a particular system. How can the optimum buffer size be deduced? Is it a factor of hard disk sector size? Memory page size? Lucene write behavior pattern? Level X Cache size?

@asfimport
Copy link
Author

asfimport commented May 24, 2007

Michael Busch (migrated from JIRA)

> I'm actually coding it up now. Why don't you commit #1508
> soonish and then I'll update & merge?

Done.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I ran two sets of tests. First is only on Mac OS X to see how
performance changes with buffer sizes. Second was also on Debian
Linux & Windows XP Pro.

The performance gains are 10-18% faster overall.

FIRST TEST

I increased buffer sizes, separately, for each of BufferedIndexInput,
BufferedIndexOutput and CompoundFileWriter. Each test is run once on
Mac OS X:

BufferedIndexInput

  1 K   622 sec (current trunk)
  4 K   607 sec
  8 K   606 sec
 16 K   598 sec
 32 K   606 sec
 64 K   589 sec
128 K   601 sec

CompoundFileWriter

  1 K   622 sec (current trunk)
  4 K   599 sec
  8 K   591 sec
 16 K   578 sec
 32 K   583 sec
 64 K   580 sec

BufferedIndexOutput

  1 K   622 sec (current trunk)
  4 K   588 sec
  8 K   576 sec
 16 K   551 sec
 32 K   566 sec
 64 K   555 sec
128 K   543 sec
256 K   534 sec
512 K   564 sec

Comments:

  • The results are fairly noisy, but, performance does generally get
    better w/ larger buffers.

  • BufferedIndexOutput seems specifically to like very large output
    buffers; the other two seem to have less but still significant
    effect.

Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer
for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think
we would get faster performance for a larger buffer for
BufferedIndexInput, but, even when merging there are quite a few of
these created (mergeFactor * N where N = number of separate index
files).

Then, I re-tested the baseline (trunk) & these buffer sizes across
platforms (below):

SECOND TEST

Baseline (trunk) = 1 K buffers for all 3. New = 16 K for
BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
BufferedIndexInput.

I ran each test 4 times & took the best time:

Quad core Mac OS X on 4-drive RAID 0
baseline 622 sec
new 527 sec
-> 15% faster

Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
baseline 708 sec
new 635 sec
-> 10% faster

Windows XP Pro laptop, single drive
baseline 1604 sec
new 1308 sec
-> 18% faster

Net/net it's between 10-18% performance gain overall. It is
interesting that the system with the "weakest" IO system (one drive on
Windows XP vs RAID 0/5 on the others) has the best gains.

@asfimport
Copy link
Author

asfimport commented May 25, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> I would like to know why these gains are appearing, and how specific
> they are to a particular system. How can the optimum buffer size be
> deduced? Is it a factor of hard disk sector size? Memory page size?
> Lucene write behavior pattern? Level X Cache size?

It looks like the gains are cross platform (at least between OS X,
Linux, Windows XP) and cross-IO architecture.

I'm not sure how this depends/correlates to the various cache/page
sizes through the layers of OS -> disk heads.

It must be that doing an IO request has a fairly high overhead and so
the more bytes you can read/write at once the faster it is, since you
amortize that overhead.

For merging in particular, with mergeFactor=10, I can see that a
larger buffer size on the input streams should help reduce insane
seeks back & forth between the 10 files (and the 1 output file).
Maybe larger reads on the input streams also cause OS's IO scheduler
to do larger read-ahead in anticipation?

And some good news: these gains seem to be additive to the gains in
#1918, at least with my initial testing.

@asfimport
Copy link
Author

John Haxby (migrated from JIRA)

> Net/net it's between 10-18% performance gain overall. It is
> interesting that the system with the "weakest" IO system (one drive on
> Windows XP vs RAID 0/5 on the others) has the best gains.

Actually, it's not that surprising. Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account. The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k). You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k. I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k. In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer.

I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help. I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached the patch based on above discussion.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Mike,

I tested and reviewed your patch. It looks good and all tests pass! Two comments:

  • Should we increase the buffer size for CompoundFileReader to 4KB not only for the merge mode but also for the normal read mode?
  • In BufferedIndexInput.setBufferSize() a new buffer should only be allocated if the new size is different from the previous buffer size.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> I tested and reviewed your patch. It looks good and all tests pass!

Thanks!

> - Should we increase the buffer size for CompoundFileReader to 4KB
> not only for the merge mode but also for the normal read mode?

I'm a little nervous about that: I don't know the impact it will have
on searching, especially queries that heavily use skipping?

Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
it does a read – its own (since each CSIndexInput subclasses from
BufferedIndexInput) and then the main stream of the
CompoundFileReader. It seems like we shouldn't do this? We should
not do a double copy.

It almost seems like the double copy would not occur becaase
readBytes() has logic to read directly from the underlying stream if
the sizes is >= bufferSize. However, I see at least one case during
merging where this logic doesn't kick in (and we do a double buffer
copy) because the buffers become "skewed". I will open a separate
issue for this; I think we should fix it and gain some more
performance.

> In BufferedIndexInput.setBufferSize() a new buffer should only be
> allocated if the new size is different from the previous buffer
> size.

Ahh, good. Will do.

@asfimport
Copy link
Author

asfimport commented May 25, 2007

Michael Busch (migrated from JIRA)

> I'm a little nervous about that: I don't know the impact it will have
> on searching, especially queries that heavily use skipping?

Doesn't the OS always read at least a whole block from the disk (usually
4 KB)? If the answer is yes, then we don't safe IO by limiting the buffer
size to 1 KB, right? Of course it makes sense to limit the size for
streams that we clone often (like the freq stream) to safe memory and
array copies. But we never clone the base stream in the cfs reader.

> Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
> it does a read – its own (since each CSIndexInput subclasses from
> BufferedIndexInput) and then the main stream of the
> CompoundFileReader. It seems like we shouldn't do this? We should
> not do a double copy.
>
> It almost seems like the double copy would not occur becaase
> readBytes() has logic to read directly from the underlying stream if
> the sizes is >= bufferSize. However, I see at least one case during
> merging where this logic doesn't kick in (and we do a double buffer
> copy) because the buffers become "skewed". I will open a separate
> issue for this; I think we should fix it and gain some more
> performance.

Good catch! Reminds me a bit of #1509 where we also did double
buffering for the RAMInputStream and RAMOutputStream. Yes, I think
we should fix this.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> > I'm a little nervous about that: I don't know the impact it will have
> > on searching, especially queries that heavily use skipping?
>
> Doesn't the OS always read at least a whole block from the disk
> (usually 4 KB)? If the answer is yes, then we don't safe IO by
> limiting the buffer size to 1 KB, right? Of course it makes sense to
> limit the size for streams that we clone often (like the freq
> stream) to safe memory and array copies. But we never clone the base
> stream in the cfs reader.

Yes, I think you're right. But we should test search
performance on a large index & "typical" queries to be sure? I will
open a new issue to track this...

@asfimport
Copy link
Author

Doug Cutting (@cutting) (migrated from JIRA)

> then we don't save IO by limiting the buffer size to 1 KB

I'm confused by this. My assumption is that, when you make a request to read 1k from a disk file, that the OS reads substantially more than 1k from the disk and places it in the buffer cache. (The cost of randomly reading 1k is nearly the same as randomly reading 100k--both are dominated by seek.) So, if you make another request to read 1k shortly thereafter you'll get it from the buffer cache and the incremental cost will be that of making a system call.

In general, one should thus rely on the buffer cache and read-ahead, and make input buffers only big enough so that system call overhead is insignificant. An alternate strategy is to not trust the buffer cache and read-ahead, but rather to make your buffers large enough so that transfer time dominates seeks. This can require 1MB or larger buffers, so isn't always practical.

So, back to your statement, a 1k buffer doesn't save physical i/o, but nor should it incur extra physical i/o. It does incur extra system calls, but uses less memory, which is a tradeoff. Is that what you meant?

@asfimport
Copy link
Author

robert engels (migrated from JIRA)

I think the important consideration is how expensive is the system call. Since the system call requires JNI, it MIGHT be expensive.

Another important consideration is buffer utilization. It is my understanding that the OS will perform read-ahead normally only in sequential access only, outside of the additional bytes read to optimize the physical read. If Lucene performs indexed reads but the data is actually being accessed sequential, Lucene managing its own buffers can far more effective.

Along these lines, if the server is heavily used for a variety of applications Lucene can manage its own buffers more efficiently - similar to how a database almost always (every commercial one I know) has its own buffer cache and does not rely on the OS. In a GC environment though it may be even more imporant if the buffers were managed/reused from a pool as you avoid the GC overhead.

Just my thoughts.

@asfimport
Copy link
Author

asfimport commented May 25, 2007

Michael Busch (migrated from JIRA)

> So, back to your statement, a 1k buffer doesn't save
> physical i/o, but nor should it incur extra physical i/o.

Yes, I agree.

> It does incur extra system calls, but uses less memory,
> which is a tradeoff.

A coworker told me that he ran some experiments with buffer
sizes in a different application, and he found out that
increasing the buffer size beyond the disks block size
further speed up things. In these tests he read a whole
file sequentially, which means that the speedup came from
saving system calls, right?

So yes, it is a tradeoff between memory usage and amount
of system calls. That's the reason for my suggestion to
only increase the buffer size for input streams that we
don't clone, like the base stream in the compound reader.

But I'm just sort of guessing here, I think we should run
some performance experiments. Mike already opened
#1968 for that.

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Mike,

another thing I just noticed is your new testcase doesn't remove
the directory "testSetBufferSize" at the end.

@asfimport
Copy link
Author

Eks Dev (migrated from JIRA)

we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff."

in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls.

The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk.

And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

> another thing I just noticed is your new testcase doesn't remove the
> directory "testSetBufferSize" at the end.

Woops, I will fix. I will also fix it to appear under the tempDir.

@asfimport
Copy link
Author

asfimport commented May 25, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

> we did some time ago a few tests and simply concluded, it boils down to what Doug said, "It does incur extra system calls, but uses less memory, which is a tradeoff."
>
> in our setup 4k was kind of magic number , ca 5-8% faster. I guess it is actually a compromise between time spent in extra os calls vs probability of reading more than you will really use (waste time on it). Why 4k number happens often to be good compromise is probably the difference in speed of buffer copy for 4k vs 1k being negligible compared to time spent on system calls.
>
> The only conclusion we came up to is that you have to measure it and find good compromise. Our case is a bit atypical, short documents (1G index, 60Mio docs) and queries with a lot of terms (80-200), Win 2003 Server, single disk.
>
> And I do not remember was it before or after we started using MMAP, so no idea really if this affects MMAP setup, guess not.

Interesting! Do you remember if your 5-8% gain was for searching or
indexing? This issue is focusing on buffer size impact on indexing
performance and #1968 is focusing on search performance.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch with these changes:

  • The unit test now creates its test index under "tempDir" & removes it when done

  • Don't allocate a new buffer in setBufferSize if the newSize == the current size

@asfimport
Copy link
Author

Michael Busch (migrated from JIRA)

Take2 looks good. +1 for committing.

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

I have some auxiliary data points to report after experimenting with buffer
size in KS today on three different systems: OS X 10.4.9, FreeBSD 5.3, and an
old RedHat 9 box.

The FS i/o classes in KinoSearch use a FILE* and
fopen/fwrite/fread/fseek/ftell, rather than file descriptors and the POSIX
family of functions. Theoretically, this is wasteful because FILE* stream i/o
is buffered, so there's double buffering happening. I've meant to change that
for some time. However, when I've used setvbuf(self->fhandle, NULL, _IONBF)
to eliminate the buffer for the underlying FILE* object, performance tanks –
indexing time doubles. I still don't understand exactly why, but I know a
little more now.

  • Swapping out the FILE* for a descriptor and switching all the I/O calls to
    POSIX variants has no measurable impact on any of these systems.

  • Changing the KS buffer size from 1024 to 4096 has no measurable impact on
    any of these systems.

  • Using setvbuf to eliminate the buffering at output turns out to have no
    impact on indexing performance. It's only killing off the read mode FILE*
    buffer that causes the problem.

So, it seems that the only change I can make moves the numbers in the wrong
direction.

The results are somewhat puzzling because I would ordinarily have blamed
sub-optimal flush/refill scheduling in my app for the degraded performance
with setvbuf() on read mode. However, the POSIX i/o calls are unbuffered, so
that's not it. My best guess is that disabling buffering for read mode
disables an fseek/ftell optimization.

@asfimport
Copy link
Author

asfimport commented May 26, 2007

Michael McCandless (@mikemccand) (migrated from JIRA)

Marvin, it's odd that you see no gains when coming straight from C.
Hmmm.

I wonder if IO calls from Java somehow have a sizable overhead that
the corresponding C calls do not. I didn't think JNI was that
expensive? Though, it looks like reading into byte[] does entail
extra byte copying. We could explore using ByteBuffers from
java.nio.* ... ahh, this has been somewhat explored already, at least
in #1492 and #1968.

Also, how much "merging" is actually done in your test / KS? How many
segments are merged at once? The fact that KS doesn't re-merge the
stored fields & term vectors (within one session) is probably also a
big difference here.

@asfimport
Copy link
Author

Marvin Humphrey (migrated from JIRA)

> Also, how much "merging" is actually done in your test / KS? How many
> segments are merged at once? The fact that KS doesn't re-merge the stored
> fields & term vectors (within one session) is probably also a big difference
> here.

In the previous test, I was indexing 1000 Reuters articles all in one session.
The postings were never flushed to disk prior to the final write, as the
external sorter never exceeded its memory threshold.

I reran the test on the FreeBSD box, changing it so that the index was built
up in 10-doc increments. There was still no measurable change for changing
either the KS buffer size or POSIX/stream style IO. However, using setvbuf on
the read mode FILE* stream i/o was utterly disastrous. After several minutes
(compare with 3.8 seconds) I cut it off. Subsequent testing with a 500-doc
increment verified that the test actually was working (10.3 secs).

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

I re-ran the "second test" above, but this time with compound file
turned off.

Baseline (trunk) = 1 K buffers for all 3. New = 16 K for
BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
BufferedIndexInput. I ran each test 4 times & took the best time.

Quad core Mac OS X on 4-drive RAID 0
baseline 553 sec
new 499 sec
-> 10% faster

Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
baseline 590 sec
new 548 sec
-> 7% faster

Windows XP Pro laptop, single drive
baseline 1078 sec
new 918 sec
-> 15% faster

Quick observations:

  • Still a healthy 7-15% overall gain even without compound file by
    increasing the buffer sizes.

  • The overall performance gain on the trunk just by turning off
    compound file ranges from 7-33% (33% gain = the Windows XP
    Laptop).

OK I plan to commit this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment