-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for posix_madvise to Java 21 MMapDirectory #13196
Conversation
I remember this kind of things being discussed more than 10 years ago, it's extremely exciting to see it close to being included in the default
I believe that we don't use |
I refactored the code a bit:
|
I removed the page size retrieval again. No need for it, we just be safe and only apply the advice, when the chunk size is large enough. |
Unfortunately after the problems with other constants I tend to hide the advice constants in the class any use some abstraction, so we can call We should maybe also simply remove the NoopNativeAccess and do a null check. |
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java
Outdated
Show resolved
Hide resolved
if (context.randomAccess) { | ||
return OptionalInt.of(NativeAccess.POSIX_MADV_RANDOM); | ||
} | ||
if (context.readOnce || context.context == Context.MERGE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move the Context.MERGE check to be the first one, because when you merge segments you certainly always want to read the files with sequential advise (although the actual access may be random to a certain degree), because POSIX_MADV_SEQUENTIAL tells in its documentation: "Hence, pages in this region can be aggressively read ahead, and may be freed soon after they are accessed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the check for MERGE context to beginning in 242e5e9. This makes it always win. The reason for that is: We often set MERGE context, but the other settings keep their defaults (like random access). So when merging is first, we always bail out and advise kernel: "may be freed soon after they are accessed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you merge segments you certainly always want to read the files with sequential advise
I've not yet checked, but is Context.MERGE
set when merging HNSW graphs? (which would favour random access)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HNSW merging is fine because the first thing that merging does it to write all vectors to a temporary file. And then it builds the graph on top of this temporary file. So it's ok for the input segments to read sequentially, and we could update the merging logic to open this temporary file with IOContext.RANDOM
since it's the one that will have a random access pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect. That is exactly what I was looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, so the current code looks fine.
One thing, I am doubting: If we delete a file, will the kernel free all those pages asap? If yes, why do we do the whole "sequentical" handling for merges at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may add another abstract method to NativeAcess that is called when we close a file. On Linux we could there give some instructions like "forget everything".
This reverts commit d311b97.
I refactored the code a bit:
One thing: Should we remove the NoopNativeAccess class and just add a null check? |
… use are Poxix only. On Windows we can have a totally different way to madvise the kernel
62c708e
to
a729f02
Compare
* Set randomAccess=true on LOAD. * Javadocs * Cleanup code and move IOContext mapping to Posix, as the constants we use are Poxix only. On Windows we can have a totally different way to madvise the kernel --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Uwe Schindler <uwe@thetaphi.de>
Thanks for the PR contributions. You may also commit directly to this branch, unless the change is a complete rewrite :-) |
I took another run at static final ;-). uschindler#5 (if we still don't want it, then I'll drop it) [EDIT] it's in PR form, since I'm not sure that @uschindler wants it! ;-) |
@jpountz Are you fine with merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed!
…o dev/posix_madvise
The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit. |
I fixed the test in ae5d353 |
…neral logging on MMapDirectory startup
I also removed the extra logging included while development from the main branch. In 9.x the log message was adapted to list both features together with the sysprop to disable). |
Oops, sorry about this. Thankfully the fix was straightforward. |
About this test and the quite large 8 MB buffer (see https://github.com/apache/lucene/pull/13196/files#diff-ebee319f3691cdb1627e5e9b1dfbdd0c266b1da28ccf0b2a9218dee1d34ff2b7R103): Is this some limit inside the kernel to trigger something? For this test, we call posix_madvise() always. Maybe there was a misunderstanding: We only do not call madvise, if the chunk size is too small (< 2 MiB), but by default the chunk size is 16 Gigabytes, so except for TestMultiMMap which uses smaller chnk sizes to test chunking logic, we always call madvise. Theres only one special case: If the file is zero length, then we can't call it. |
I dunno what I was thinking, this is clearly not correct. I opened #13214 to fix the test. ( apologies for the stupid test issues! ) |
@uschindler Should we open a separate issue for adding |
Unfortunately fadvise is at moment close to impossible. Reason: we have no file handle! Chances are good that we also get a Java-based fadvise some time in the future (e.g., through an OpenOption like with O_DIRECT). |
As disussed before, for implementing fadvise for reading/writing files, we would need to write a full stack of IO layer natively (OutputStream for writing and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771 |
Anyways we can open an issue to track what's going on on the JDK (listing all relevant issue numbers like the above one). |
Here is the |
This is a first idea how we can use Panama Foreign to pass
madvise()
hints to the kernel when mapping memory segments.The code looks up the function pointer from stdlib (libc) on Linux and Macos (untested, but should work) and then invokes
madvise()
for all MemorySegments we have mmapped when the following is true:chunkSizePower>=13
) - this prevents TestMultiMMap from failing because for very small mappings (as done by this test), theFileChannel#map
call will produce unaligned memory segments (it uses some tricks and maps larger segments and returns slices - which are no longer pageSize aligned)Interestingly it works without any extra parameters to command line (at least in Java 21).
This is a draft only to do some performance tests and extend the IOContext interpretation to try out more possibilities. The current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in RAM. MADV_SEQUENTIAL instructs kernel to forget about the mappings and also do readahead which helps during merging.