-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we fold DirectIODirectory into FSDirectory? #13194
Comments
We can fold it in, but make it optional. Mike's investigation is fine but not the only truth. Circumventing OS cache is not always a good idea and you should only carefully do it for some use cases, not by default. If Elasticsearch or Solr wants to do this, ok. But lower level lucene should have all options. I would like it more to maybe move the directory to Lucene core and make the FSDirecory#open factory use it if it is supported by filesystem and JVM (the reflection hack!). Not sure how to detect if it's supported by filesystem without actually opening a file. 🤨 Finally, my personal opinion is to keep everything as is. We also do not fold NRTCachingDirectory into FSDirectory? |
In my opinion, we can add a separate series of factory methods to FSDirectory to also create NRTCaching wrappers. This would make it easier for beginners. |
If you have enough memory on your machine, not using directio is still recommended. If you are indexing and reading index at same time and you merge segments, why do you want the merge option be not cached at all? If you write a new file why should it not be in FS cache from beginning? If you query the new segment it's cold...! On the other hand, why should the merger not make use of already cached segments? The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir in future, so we can based on IOcontext give some hints to kernel. There's only one issue open that I am discussing with Maurizio and Alan Bateman: make the file descriptor (long) available by the nio/Path/Socket APIs in Java like you can get the PID from Process AP I- to allow to pass it to native code. Currently it is hidden in private internals Directio makes sense when you only index but not query at same time. |
That works for me. I'm not worried about Elastisearch and Solr, they have people who know all the Lucene hidden features, I was more thinking of people who use Lucene directly who may not know about all our semi-hidden features. Now that you mentioned
These are trade-offs indeed. The bias I have is that if you ask me if it's better for Lucene to optimize for the case when the whole index fits into RAM vs. the index is larger than the page cache, I'd probably prefer to optimize for making Lucene better at handling large indexes out-of-the-box. Like you said fadvise/madvise are likely more appropriate than direct I/O for merging.
This would be great! |
++ Yes, and we can set that always when opening the file containing vector data.
Lemme know if I can help move this forward. |
Everybody can help. Basically it would be good to be able to get the LONG file handle (windows) or File Descriptor from a FileChannel/FileInputStream/FileOutputStream in a similar way like you can get the process ID (PID) from the ProcessBuilder API. There is no security risk in it, as you cannot do any harm with it - like you cannot do any harm with the process id of a child process returned by ProcessBuilder from Java code only. It also contains no sensitive data. Initially this was not clear (there's an issue open), but in Brussels Alan Bateman and Mauricio agreed to add such an API. I think we need to push it a bit. You can help. |
Once we have that we can add some code to lookup the MethodHandle for the madvise / fadvise from Panama and then call it with the file handle/descriptor returned by the native API. This is all optional, we could do it similar to vector API: Only get the MethodHandle if native access is enabled with command line flags, otherwise madvise/fadvise is a noop. We could also think of adding |
I think the Direct-IO solution is hacky and trappy myself. This is a really slow way to do perform IO, with the hope? that it will somehow make some separate NRT search process faster?
|
@ChrisHegarty: For MMapDirectory there's another way for madvise to work (at least for reading index files): When the user has enabled native Panama access, we can also mmap our MemorySegments with our own code without using FileChannel, because we have to deal with file descriptor anyways and native access was already enabled. In fact we close the FileChannel after mapping anyways, so the FileChannel is completely useless for MMapDirectory code. We would need code for Posix and Windows. For writing files it is more complicated, so setting fadvise there also involves replicating the whole (File-)OutputStream used for writing index files. |
Yes, that is exactly what I've been thinking too. We're leaning more and more into FFI with Elasticsearch and it is proving to both behave and perform very well. Abstracting the functionality at a sightly higher-level may allow to have a Posix implementation and a fallback to an implementation based FileChannel. The latter would cover Windows - possibly ignoring the "hints" for random access. This could be a start, allowing to add more platform support as needed.
Ugh!! not great. We need either the memory address or the file descriptor. Not to mention that we map in 16GB chunks, so may require a little more plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you @uschindler encounter issues when mapping larger chunks? |
Yes, my tests on Windows showed havoc with large files when randomly mapping unmapping them. If you mmap a 1 Terabyte file and another one, then unmap one of them and mmap a slightly larger one -> running out of address space. On Linux its the same, but address space is larger (kernel has different mapping). But this is not a problem form MemorySegmentIndexInputProvider, the mmap call is easy to do, the splicing is already there. We will just need two implementations for the filename -> MemorySegment[]. Easy, no changes needed. |
I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments. So fadvise is the bigger problem needed for writing. |
++ I think we have all we need enable this for reading - through a "hints" style API in IOContext. I can take a look at it, but probably not until next week. I'm sure you already know this, but here's how we've been abstracting FFI in Elasticsearch, https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess . While the use case is not the same here, the necessary abstractions and framework are not unalike. |
Thanks, this is great. For the use-case here, I think abstractions are not needed, so we can implement it in our MR-JAR:
I will hack a draft PR with the basic setup later (only in main branch with Java 21), exposing madvise() and mlock2() ase examples and then we can do tests on Linux (maybe also macos) by passing different options depending on IOContext. P.S.: The lucene/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java Line 33 in d5f8853
|
I created a draft PR for IOContext#readOnce: #13196 |
Closing this issue as won't fix, I imagine that madvise/fadvise is considered a superior solution than direct I/O. |
Description
DirectIODirectory
is appealing on paper by limiting the churn that merges incur on the page cache, thus helping searches have a better hit ratio on this cache. @mikemccand has a nice blog about this feature.Back in the past when it was called
NativeUnixDirectory
, it used to rely on native code, which made inclusion in lucene/core a no-go. But nowadays it looks more reasonable, so should we try to fold its logic intoFSDirectory
?There is still a reflection trick, I wonder how bad it is. Opinions @uschindler @ChrisHegarty ?
The text was updated successfully, but these errors were encountered: