-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use NIO positional read to avoid synchronization in FSIndexInput [LUCENE-753] #1828
Comments
Yonik Seeley (@yonik) (migrated from JIRA) Patch for FSIndexInput to use a positional read call that doesn't use explicit synchronization. Note that the implementation of that read call may still involve some synchronization depending on the JVM and OS (notably Windows which lacks a native pread AFAIK). |
Yonik Seeley (@yonik) (migrated from JIRA) This change should be faster on heavily loaded multi-threaded servers using the non-compound index format. Compound index format (CSIndexInput) still does synchronization because the base IndexInput is not cloned (and hence shared by all CSIndexInput clones). It's unclear if getting rid of the synchronization is worth the cloning overhead in this case. |
Doug Cutting (@cutting) (migrated from JIRA) This patch continues to use BufferedIndexInput and allocates a new ByteBuffer for each call to read(). I wonder if it might be more efficient to instead directly extend IndexInput and always represent the buffer as a ByteBuffer? |
Yonik Seeley (@yonik) (migrated from JIRA) CSIndexInput synchronization could also be elimitated if there was a pread added to IndexInput public abstract void readBytes(byte[] b, int offset, int len, long fileposition) Unfortunately, that would break any custom Directory based implementations out there, and we can't provide a suitable default with seek & read because we don't know what object to synchronize on. |
Yonik Seeley (@yonik) (migrated from JIRA) Here is a patch that directly extends IndexInput to make things a little easier. ByteBuffer does have offset, length, etc, but I did not use them because BufferedIndexInput currently allocates the byte[] on demand, and thus would add additional checks to readByte(). Also, the NIO Buffer.get() isn't as efficient as our own array access. |
Bogdan Ghidireac (migrated from JIRA) You can find a NIO variation of IndexInput attached to this issue: #1597 I had good results on multiprocessor machines under heavy load. Regards, |
Yonik Seeley (@yonik) (migrated from JIRA) Thanks for the pointer Bogdan, it's interesting you use transferTo instead of read... is there any advantage to this? You still need to create a new object every read(), but at least it looks like a smaller object. It's also been pointed out to me that #1492 has some more NIO code. |
Bogdan Ghidireac (migrated from JIRA) The Javadoc says that transferTo can be more efficient because the OS can transfer bytes directly from the filesystem cache to the target channel without actually copying them. |
Yonik Seeley (@yonik) (migrated from JIRA) > The Javadoc says that transferTo can be more efficient because the OS can transfer bytes Unfortunately, only for DirectByteBuffers and other FileChannels, not for HeapByteBuffers. |
Yonik Seeley (@yonik) (migrated from JIRA) Attaching test that reads a file in different ways, either random access or serially, from a number of threads. |
Yonik Seeley (@yonik) (migrated from JIRA) Single-threaded random access performance of a fully cached 64MB file on my home PC (WinXP) , Java6: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=1024 filelen=6518936 |
Brian Pinkerton (migrated from JIRA) Most of my workloads would benefit by removing the synchronization in FSIndexInput, so I took a closer look at this issue. I found exactly the opposite results that Yonik did on two platforms that I use frequently in production (Solaris and Linux), and by a significant margin. I even get the same behavior on the Mac, though I'm not running Java6 there.
config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760
config: impl=ChannelPread serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 config: impl=ClassicFile serial=false nThreads=200 iterations=10 bufsize=1024 filelen=10485760 |
Yonik Seeley (@yonik) (migrated from JIRA) Brad, one possible difference is the number of threads we tested with. A normal production system shouldn't see 200 concurrent running search threads unless it's just about to fall over, or unless it's one of those massive multi-core systems. After you pass a certain amount of parallelism, NIO can help. |
Brian Pinkerton (migrated from JIRA) Whoops; I should have paid more attention to the args. The results in the single-threaded case still favor pread, but by a slimmer margin: Linux: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 Solaris 10: config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 Mac OS X: config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=10485760 |
Doug Cutting (@cutting) (migrated from JIRA) > Brad, [...] That's Brian. And right, the difference in your tests is the number of threads. Perhaps this is a case where one size will not fit all. MmapDirectory is fastest on 64-bit platforms with lots of threads, while good-old-FSDirectory has always been fastest for single-threaded access. Perhaps a PreadDirectory would be the Directory of choice for multi-threaded access of large indexes on 32-bit hardware? It would be useful to benchmark this patch against MmapDirectory, since they both remove synchronization. |
Doug Cutting (@cutting) (migrated from JIRA) My prior remarks were posted before I saw Brian's latest benchmarks. While it would still be good to throw mmap into the mix, pread now looks like a strong contender for the one that might beat all. It works well on 32-bit hardware, it's unsynchronized, and it's fast. What's not to like? |
Yonik Seeley (@yonik) (migrated from JIRA) Weird... I'm still getting slower results from pread on WinXP. Yonik@spidey \~
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=14984, MB/sec=128.35024025627337
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ClassicFile false 1 200
config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=14640, MB/sec=131.36612021857923
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=21766, MB/sec=88.35798952494717
$ c:/opt/jdk16/bin/java -server FileReadTest testfile ChannelPread false 1 200
config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=1024 filelen=9616000
answer=160759732, ms=21718, MB/sec=88.55327378211622
$ c:/opt/jdk16/bin/java -version
java version "1.6.0_02"
Java(TM) SE Runtime Environment (build 1.6.0_02-b06)
Java HotSpot(TM) Client VM (build 1.6.0_02-b06, mixed mode) |
robert engels (migrated from JIRA) I sent this via email, but probably need to add to the thread... I posted a bug on this to Sun a long while back. This is a KNOWN BUG on Windows. NIO preads actually sync behind the scenes on some platforms. Using multiple file descriptors is much faster. See bug http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734 |
Doug Cutting (@cutting) (migrated from JIRA) So it looks like pread is ~50% slower on Windows, and ~5-25% faster on other platforms. Is that enough of a difference that it might be worth having FSDirectory use different implementations of FSIndexInput based on the value of Constants.WINDOWS (and perhaps JAVA_VERSION)? |
Michael McCandless (@mikemccand) (migrated from JIRA)
+1 I think having good out-of-the-box defaults is extremely important (most users won't tune), and given the substantial cross platform differences here I think we should conditionalize the defaults according to the platform. |
robert engels (migrated from JIRA) As an aside, if the Lucene people voted on the Java bug (and or sent emails via the proper channels), hopefully the underlying bug can be fixed in the JVM. In my opinion it is a serious one - ruins any performance gains of using NIO on files. |
Yonik Seeley (@yonik) (migrated from JIRA) Updated test that fixes some thread synchronization issues to ensure that the "answer" is the same for all methods. Brian, in some of your tests the answer is "0"... is this because your test file consists of zeros (created via /dev/zero or equiv)? UNIX systems treat blocks of zeros differently than normal files (they are stored as holes). It shouldn't make too much of a difference in this case, but just to be sure, could you try with a real file? |
Brian Pinkerton (migrated from JIRA) Yeah, the file was full of zeroes. But I created the files w/o holes and was using filesystems that don't compress file contents. Just to be sure, though, I repeated the tests with a file with random contents; the results above still hold. |
Brian Pinkerton (migrated from JIRA) BTW, I think the performance win with Yonik's patch for some workloads could be far greater than what the simple benchmark illustrates. Sure, pread might be marginally faster. But the real win is avoiding synchronized access to the file. I did some IO tracing a while back on one particular workload that is characterized by:
In this workload where each query hits each compound index, the locking in FSIndexInput means that a single rare query clobbers the response time for all queries. The requests to read cached data are serialized (fairly, even) with those that hit the disk. While we can't get rid of the rare queries, we can allow the common ones to proceed against cached data right away. |
Michael McCandless (@mikemccand) (migrated from JIRA) I ran Yonik's most recent FileReadTest.java on the platforms below, I tested two Windows XP Pro machines and got opposite results from I'm showing ChannelTransfer to be much faster on all platforms except The ChannelTransfer test is giving the wrong checksum, but I think Mac OS X 10.4 (Sun java 1.5) config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 Linux 2.6.22.1 (Sun java 1.5) config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 Windows Server 2003 R2 Enterprise x64 (Sun java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 Windows XP Pro SP2, laptop (Sun Java 1.5) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 Windows XP Pro SP2, older desktop (Sun Java 1.6) config: impl=ClassicFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelFile serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=1 iterations=200 bufsize=6518936 filelen=67108864 |
Michael McCandless (@mikemccand) (migrated from JIRA) I also just ran a test with 4 threads, random access, on Linux 2.6.22.1: config: impl=ClassicFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelFile serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelPread serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 config: impl=ChannelTransfer serial=false nThreads=4 iterations=200 bufsize=6518936 filelen=67108864 ChannelTransfer got even faster (scales up with added threads better). |
Yonik Seeley (@yonik) (migrated from JIRA) Mike, it looks like you are running with a bufsize of 6.5MB! |
Yonik Seeley (@yonik) (migrated from JIRA) I'll try fixing the transferTo test before anyone re-runs any tests. |
Michael McCandless (@mikemccand) (migrated from JIRA) Doh!! Woops :) I will rerun... |
Yonik Seeley (@yonik) (migrated from JIRA) OK, uploading latest version of the test that should fix ChannelTransfer (it's also slightly optimized to not create a new object per call). Well, at least we've learned that printing out all the input params for benchmarking programs is good practice :-) |
Michael McCandless (@mikemccand) (migrated from JIRA)
That would be awesome, Jason. I think we should then commit NIOFSDirectory to core as at least a way around this bottleneck on all platforms but Windows. Maybe we can do this in time for 2.4? |
Matthew Mastracci (migrated from JIRA)
Michael, I ran this new patch against our big index and it works very well. If I have time, I'll run some benchmarks to see what our real-life performance improvements are like. Note that I'm only running it for our read-only snapshot of the index, however, so this hasn't been tested for writing to a large index. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Super, thanks! This was the same index that would reliably hit the exception above? |
Matthew Mastracci (migrated from JIRA)
Correct - it would hit the exception every time at startup. I've been running NIOFSDirectory for the last couple of hours with zero exceptions (except for running out of file descriptors after starting it the first time :)). The previous incarnation was running MMapDirectory. Thanks for all the work on this patch. |
Matthew Mastracci (migrated from JIRA) This exception popped up out of the blue a few hours in. No exceptions before it. I'll see if I can figure out whether it was caused by our index snapshotting or if it's a bug elsewhere in NIOFSDirectory. I haven't seen any exceptions like this with MMapDirectory, but it's possible there's something that we're doing that isn't correct.
|
Michael McCandless (@mikemccand) (migrated from JIRA) Interesting... Are you really sure you're not accidentally closing the searcher before calling Searcher.docFreqs? Are you calling docFreqs directly from your app? It looks like MMapIndexInput.close() is a noop so it would not have detected calling Searcher.docFreqs after close, whereas NIOFSdirectory (and the normal FSDirectory) will. If you try the normal FSDirectory do you also an exception like this? Incidentally, what sort of performance differences are you noticing between these three different ways of accessing an index in the file system? |
Yonik Seeley (@yonik) (migrated from JIRA)
+1 Latest patch is looking good to me! Also, our finalizers aren't technically thread safe which could lead to a double close in the finalizer (although I doubt if this particular case would ever happen). If we need to keep them, we could change Descriptor.isOpen to volatile and there should be pretty much no cost since it's only checked in close(). |
Michael McCandless (@mikemccand) (migrated from JIRA)
Yonik, do you mean BufferedIndexInput.clone (not FSIndexInput)? I think once we fix NIOFSIndexInput to subclass from BufferedIndexInput, then cloning should be lazy again. Jason are you working on this (subclassing from BufferedIndexInput)? If not I can take it.
Hmmm... I'll update both FSDirectory and NIOFSDiretory's isOpen's to be volatile. |
Jason Rutherglen (migrated from JIRA) Mike I have have not started on the subclassing from BufferedIndexInput yet. I can work on it monday though. |
Michael McCandless (@mikemccand) (migrated from JIRA)
OK, thanks! |
Michael McCandless (@mikemccand) (migrated from JIRA) Updated patch with Yonik's volatile suggestion – thanks Yonik! Also, I removed NIOFSDirectory.createOutput since it was doing the same thing as super(). |
Matthew Mastracci (migrated from JIRA) Michael,
Our IndexReaders are actually managed in a shared pool (currently 8 IndexReaders, shared round-robin style as requests come in). We have some custom reference counting logic that's supposed to keep the readers alive as long as somebody has them open. As new index snapshots come in, the IndexReaders are re-opened and reference counts ensure that any old index readers in use are kept alive until the searchers are done with them. I'm guessing we have an error in our reference counting logic that just doesn't show up under MMapDirectory (as you mentioned, close() is a no-op). We're calling docFreqs directly from our app. I'm guessing that it just happens to be the most likely item to be called after we roll to a new index snapshot. I don't have hard performance numbers right now, but we were having a hard time saturating I/O or CPU with FSDirectory. The locking was basically killing us. When we switched to MMapDirectory and turned on compound files, our performance jumped at least 2x. The preliminary results I'm seeing with NIOFSDirectory seem to indicate that it's slightly faster than MMapDirectory. I'll try setting our app back to using the old FSDirectory and see if the exceptions still occur. I'll also try to fiddle with our unit tests to make sure we're correctly ref-counting all of our index readers. BTW, I ran a quick FSDirectory/MMapDirectory/NIOFSDirectory shootout. It uses a parallel benchmark that roughly models what our real-life benchmark is like. I ran the benchmark once through to warm the disk cache, then got the following. The numbers are fairly stable across various runs once the disk caches are warm: FS: 33644ms I'm a bit surprised at the results myself, but I've spent a bit of time tuning the indexes to maximize concurrency. I'll double-check that the benchmark is correctly running all of the tests. The benchmark effectively runs 10-20 queries in parallel at a time, then waits for all queries to complete. It does this end-to-end for a number of different query batches, then totals up the time to complete each batch. |
Jason Rutherglen (migrated from JIRA) LUCENE-753.patch NIOFSIndexInput now extends BufferedIndexInput. I was unable to test however and wanted to just get this up. |
Michael McCandless (@mikemccand) (migrated from JIRA)
This is surprising – your benchmark is very concurrent, yet FSDir and NIOFSDir are close to the same net throughput, while MMapDir is quite a bit faster. Is this on a non-Windows OS? |
Michael McCandless (@mikemccand) (migrated from JIRA) New patch attached. Matthew if you could try this version out on your I didn't like how we were still copying the hairy readBytes & refill But, then I realized we were duplicating alot of code from Some other things also fixed:
To test this, I made NIOFSDirectory the default IMPL in I also built first 150K docs of wikipedia and ran various searches The class is quite a bit simpler now, however there's one thing I |
Michael McCandless (@mikemccand) (migrated from JIRA) New version attached. This one re-uses a wrapped byte buffer even when it's CSIndexInput that's calling it. I plan to commit in a day or two. |
Michael McCandless (@mikemccand) (migrated from JIRA) I just committed revision 690539, adding NIOFSDirectory. I will leave this open, but move off of 2.4, until we can get similar performance gains on Windows... |
robert engels (migrated from JIRA) SUN is accepting outside bug fixes to the Open JDK, and merging them to the commercial JDK (in most cases). If the underlying bug is fixed in the Windows JDK - not too hard - then you fix this properly in Lucene. If you don't fix it in the JDK you are always going to have the 'running out of file handles' synchronization, vs, the "locked position" synchronization - there is no way to fix this in user code... |
Yonik Seeley (@yonik) (migrated from JIRA) Attaching new FileReadTest.java that fixes a concurrency bug in SeparateFile - each reader needed it's own file position. |
Uwe Schindler (@uschindler) (migrated from JIRA) This issue was resolved a long time ago, but left open for the stupid Windows Sun JRE bug which was never resolved. With Lucene 3.x and trunk we have better defaults (use e.g. MMapDirectory on Windows-64). Users should default to FSDirectory.open() and use the returned directory for best performance. |
As suggested by Doug, we could use NIO pread to avoid synchronization on the underlying file.
This could mitigate any MT performance drop caused by reducing the number of files in the index format.
Migrated from LUCENE-753 by Yonik Seeley (@yonik), 5 votes, resolved Jan 25 2011
Attachments: FileReadTest.java (versions: 8), FSDirectoryPool.patch, FSIndexInput.patch (versions: 2), lucene-753.patch (versions: 2), LUCENE-753.patch (versions: 5)
The text was updated successfully, but these errors were encountered: