Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we fold DirectIODirectory into FSDirectory? #13194

Closed
jpountz opened this issue Mar 20, 2024 · 16 comments
Closed

Should we fold DirectIODirectory into FSDirectory? #13194

jpountz opened this issue Mar 20, 2024 · 16 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Mar 20, 2024

Description

DirectIODirectory is appealing on paper by limiting the churn that merges incur on the page cache, thus helping searches have a better hit ratio on this cache. @mikemccand has a nice blog about this feature.

Back in the past when it was called NativeUnixDirectory, it used to rely on native code, which made inclusion in lucene/core a no-go. But nowadays it looks more reasonable, so should we try to fold its logic into FSDirectory?

There is still a reflection trick, I wonder how bad it is. Opinions @uschindler @ChrisHegarty ?

@uschindler
Copy link
Contributor

uschindler commented Mar 20, 2024

We can fold it in, but make it optional. Mike's investigation is fine but not the only truth. Circumventing OS cache is not always a good idea and you should only carefully do it for some use cases, not by default. If Elasticsearch or Solr wants to do this, ok. But lower level lucene should have all options.

I would like it more to maybe move the directory to Lucene core and make the FSDirecory#open factory use it if it is supported by filesystem and JVM (the reflection hack!). Not sure how to detect if it's supported by filesystem without actually opening a file. 🤨

Finally, my personal opinion is to keep everything as is. We also do not fold NRTCachingDirectory into FSDirectory?

@uschindler
Copy link
Contributor

In my opinion, we can add a separate series of factory methods to FSDirectory to also create NRTCaching wrappers. This would make it easier for beginners.

@uschindler
Copy link
Contributor

If you have enough memory on your machine, not using directio is still recommended.

If you are indexing and reading index at same time and you merge segments, why do you want the merge option be not cached at all? If you write a new file why should it not be in FS cache from beginning? If you query the new segment it's cold...! On the other hand, why should the merger not make use of already cached segments?

The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir in future, so we can based on IOcontext give some hints to kernel. There's only one issue open that I am discussing with Maurizio and Alan Bateman: make the file descriptor (long) available by the nio/Path/Socket APIs in Java like you can get the PID from Process AP I- to allow to pass it to native code. Currently it is hidden in private internals

Directio makes sense when you only index but not query at same time.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 20, 2024

If Elasticsearch or Solr wants to do this, ok. But lower level lucene should have all options.

That works for me. I'm not worried about Elastisearch and Solr, they have people who know all the Lucene hidden features, I was more thinking of people who use Lucene directly who may not know about all our semi-hidden features. Now that you mentioned NRTCachingDirectory, HardlinkCopyDirectoryWrapper comes to mind as well as another useful optimization that may be worth folding into FSDirectory. If enabling some of these things by default is controversial, making them easier to use still sounds like a good step forward.

If you query the new segment it's cold...! On the other hand, why should the merger not make use of already cached segments?

These are trade-offs indeed. The bias I have is that if you ask me if it's better for Lucene to optimize for the case when the whole index fits into RAM vs. the index is larger than the page cache, I'd probably prefer to optimize for making Lucene better at handling large indexes out-of-the-box. Like you said fadvise/madvise are likely more appropriate than direct I/O for merging.

I am planning to have some optional panama foreign in Mmapdir in future, so we can based on IOcontext give some hints to kernel.

This would be great!

@ChrisHegarty
Copy link
Contributor

The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir in future, so we can based on IOcontext give some hints to kernel.

++ Yes, and we can set that always when opening the file containing vector data.

There's only one issue open that I am discussing with Maurizio and Alan Bateman: make the file descriptor (long) available by the nio/Path/Socket APIs in Java like you can get the PID from Process AP I- to allow to pass it to native code.

Lemme know if I can help move this forward.

@uschindler
Copy link
Contributor

The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir in future, so we can based on IOcontext give some hints to kernel.

++ Yes, and we can set that always when opening the file containing vector data.

There's only one issue open that I am discussing with Maurizio and Alan Bateman: make the file descriptor (long) available by the nio/Path/Socket APIs in Java like you can get the PID from Process AP I- to allow to pass it to native code.

Lemme know if I can help move this forward.

Everybody can help. Basically it would be good to be able to get the LONG file handle (windows) or File Descriptor from a FileChannel/FileInputStream/FileOutputStream in a similar way like you can get the process ID (PID) from the ProcessBuilder API. There is no security risk in it, as you cannot do any harm with it - like you cannot do any harm with the process id of a child process returned by ProcessBuilder from Java code only. It also contains no sensitive data.

Initially this was not clear (there's an issue open), but in Brussels Alan Bateman and Mauricio agreed to add such an API. I think we need to push it a bit. You can help.

https://bugs.openjdk.org/browse/JDK-8292771

@uschindler
Copy link
Contributor

Once we have that we can add some code to lookup the MethodHandle for the madvise / fadvise from Panama and then call it with the file handle/descriptor returned by the native API. This is all optional, we could do it similar to vector API: Only get the MethodHandle if native access is enabled with command line flags, otherwise madvise/fadvise is a noop.

We could also think of adding mlock2() to stick parts of MemorySegments to RAM. This works out of box as you can pass a MemorySegment in MMapDir directly to native code as a pointer.

@rmuir
Copy link
Member

rmuir commented Mar 20, 2024

I think the Direct-IO solution is hacky and trappy myself. This is a really slow way to do perform IO, with the hope? that it will somehow make some separate NRT search process faster?

  1. not every app uses lucene NRT with indexing in this way and not every app wants lucene to take over entire machine. Example would be IDE that just uses lucene for searching: the user is doing plenty of other shit on their machine (e.g. compiling code and stuff).
  2. i dont understand the reflection on JDK classes and why we need that
  3. direct IO comes with a lot of complexity and downsides: really it isn't geared at what the code is doing here, it is instead geared at "special situations, such as when applications do their own caching". That is straight from the open(2) documentation.
  4. direct IO hack brings degraded performance, IMO its not the correct solution to the problem but just a hack. why move this into lucene core? a better solution would be something like https://lwn.net/Articles/806980/
  5. better solution with current lucene support, if you dont want merging to disrupt page cache for searches is to use NRT segment replication which allows searchers to be totally separate from indexing/merging.

@uschindler
Copy link
Contributor

uschindler commented Mar 20, 2024

@ChrisHegarty: For MMapDirectory there's another way for madvise to work (at least for reading index files): When the user has enabled native Panama access, we can also mmap our MemorySegments with our own code without using FileChannel, because we have to deal with file descriptor anyways and native access was already enabled. In fact we close the FileChannel after mapping anyways, so the FileChannel is completely useless for MMapDirectory code. We would need code for Posix and Windows.

For writing files it is more complicated, so setting fadvise there also involves replicating the whole (File-)OutputStream used for writing index files.

@ChrisHegarty
Copy link
Contributor

@ChrisHegarty: For MMapDirectory there's another way for madvise to work (at least for reading index files): When the user has enabled native Panama access, we can also mmap our MemorySegments with our own code without using FileChannel, because we have to deal with file descriptor anyways and native access was already enabled. In fact we close the FileChannel after mapping anyways, so the FileChannel is completely useless for MMapDirectory code. We would need code for Posix and Windows.

Yes, that is exactly what I've been thinking too. We're leaning more and more into FFI with Elasticsearch and it is proving to both behave and perform very well. Abstracting the functionality at a sightly higher-level may allow to have a Posix implementation and a fallback to an implementation based FileChannel. The latter would cover Windows - possibly ignoring the "hints" for random access. This could be a start, allowing to add more platform support as needed.

For writing files it is more complicated, so setting fadvise there also involves replicating the whole (File-)OutputStream used for writing index files.

Ugh!! not great. We need either the memory address or the file descriptor.

Not to mention that we map in 16GB chunks, so may require a little more plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you @uschindler encounter issues when mapping larger chunks?

@uschindler
Copy link
Contributor

uschindler commented Mar 21, 2024

Not to mention that we map in 16GB chunks, so may require a little more plumbing (but nothing too hard). Is the mapping in chunks still needed? Did you @uschindler encounter issues when mapping larger chunks?

Yes, my tests on Windows showed havoc with large files when randomly mapping unmapping them. If you mmap a 1 Terabyte file and another one, then unmap one of them and mmap a slightly larger one -> running out of address space. On Linux its the same, but address space is larger (kernel has different mapping).

But this is not a problem form MemorySegmentIndexInputProvider, the mmap call is easy to do, the splicing is already there. We will just need two implementations for the filename -> MemorySegment[]. Easy, no changes needed.

@uschindler
Copy link
Contributor

I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments.

So fadvise is the bigger problem needed for writing.

@ChrisHegarty
Copy link
Contributor

I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments.

++ I think we have all we need enable this for reading - through a "hints" style API in IOContext.

I can take a look at it, but probably not until next week. I'm sure you already know this, but here's how we've been abstracting FFI in Elasticsearch, https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess . While the use case is not the same here, the necessary abstractions and framework are not unalike.

@uschindler
Copy link
Contributor

uschindler commented Mar 21, 2024

I just noticed, for madvise no file descriptor is needed. We can implement that straight away using the memory segments.

++ I think we have all we need enable this for reading - through a "hints" style API in IOContext.

I can take a look at it, but probably not until next week. I'm sure you already know this, but here's how we've been abstracting FFI in Elasticsearch, https://github.com/elastic/elasticsearch/tree/main/libs/native/src/main/java/org/elasticsearch/nativeaccess . While the use case is not the same here, the necessary abstractions and framework are not unalike.

Thanks, this is great. For the use-case here, I think abstractions are not needed, so we can implement it in our MR-JAR:

  • we only need defaultLinker() so no System.loadLibrary and no special case for Java 22 needed (we don't use C strings).
  • looking up function addresses will be guarded with tryCatch and when we get a MethodHandle for madvise() or posix_madvise() we store it in the provider. If we can't find it or native access was denied by JVM (not allowed to module or by command line) we make it a noop MethodHandle or set it null
  • after when FileChannel returns a memorySegment, we just call madvise depending on IOContext with the MemorySegment as first parameter and the slice (0...length of segment):

I will hack a draft PR with the basic setup later (only in main branch with Java 21), exposing madvise() and mlock2() ase examples and then we can do tests on Linux (maybe also macos) by passing different options depending on IOContext.

P.S.: The IOContext is already available to the provider class (I did this in the original API design for exactly that use case; currently the parameter is unused):

public IndexInput openInput(Path path, IOContext context, int chunkSizePower, boolean preload)

@uschindler
Copy link
Contributor

I created a draft PR for IOContext#readOnce: #13196

@jpountz
Copy link
Contributor Author

jpountz commented Mar 26, 2024

Closing this issue as won't fix, I imagine that madvise/fadvise is considered a superior solution than direct I/O.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants