Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for posix_madvise to Java 21 MMapDirectory #13196

Merged
merged 27 commits into from
Mar 25, 2024

Conversation

uschindler
Copy link
Contributor

This is a first idea how we can use Panama Foreign to pass madvise() hints to the kernel when mapping memory segments.

The code looks up the function pointer from stdlib (libc) on Linux and Macos (untested, but should work) and then invokes madvise() for all MemorySegments we have mmapped when the following is true:

  • IOContext#readOnce is set
  • The chunk size is large enough (at least 8192 means chunkSizePower>=13) - this prevents TestMultiMMap from failing because for very small mappings (as done by this test), the FileChannel#map call will produce unaligned memory segments (it uses some tricks and maps larger segments and returns slices - which are no longer pageSize aligned)
  • There is a noop implementation doing nothing which is choosen if you disable native access

Interestingly it works without any extra parameters to command line (at least in Java 21).

This is a draft only to do some performance tests and extend the IOContext interpretation to try out more possibilities. The current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in RAM. MADV_SEQUENTIAL instructs kernel to forget about the mappings and also do readahead which helps during merging.

@jpountz
Copy link
Contributor

jpountz commented Mar 21, 2024

I remember this kind of things being discussed more than 10 years ago, it's extremely exciting to see it close to being included in the default Directory!

current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in

I believe that we don't use readOnce for merging, but for tiny metadata files that we load fully in memory. That said, I don't think that this should block this change, we can later look into the proper way to detect whether to pass MADV_SEQUENTIAL based on the IOContext.

@uschindler
Copy link
Contributor Author

I refactored the code a bit:

  • @jpountz I added the merge context (I checked DirectIODirectory).
  • I retrieve the pageSize to check if a segment is correctly aligned. Unfortunately this is mega-stupid. I tend to remove it again (the constant _SC_PAGESIZE is part of a C enum and its value is different in various platforms as no fixed value is assigned) and maybe keep a hardcoded page size to handle the case. Maybe just check that it is a multiple of 64K (native page size is 8K, but may be different), because it is normally only an issue in this test....

@uschindler
Copy link
Contributor Author

I removed the page size retrieval again. No need for it, we just be safe and only apply the advice, when the chunk size is large enough.

@uschindler
Copy link
Contributor Author

Unfortunately after the problems with other constants I tend to hide the advice constants in the class any use some abstraction, so we can call

We should maybe also simply remove the NoopNativeAccess and do a null check.

if (context.randomAccess) {
return OptionalInt.of(NativeAccess.POSIX_MADV_RANDOM);
}
if (context.readOnce || context.context == Context.MERGE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move the Context.MERGE check to be the first one, because when you merge segments you certainly always want to read the files with sequential advise (although the actual access may be random to a certain degree), because POSIX_MADV_SEQUENTIAL tells in its documentation: "Hence, pages in this region can be aggressively read ahead, and may be freed soon after they are accessed."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check for MERGE context to beginning in 242e5e9. This makes it always win. The reason for that is: We often set MERGE context, but the other settings keep their defaults (like random access). So when merging is first, we always bail out and advise kernel: "may be freed soon after they are accessed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you merge segments you certainly always want to read the files with sequential advise

I've not yet checked, but is Context.MERGE set when merging HNSW graphs? (which would favour random access)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HNSW merging is fine because the first thing that merging does it to write all vectors to a temporary file. And then it builds the graph on top of this temporary file. So it's ok for the input segments to read sequentially, and we could update the merging logic to open this temporary file with IOContext.RANDOM since it's the one that will have a random access pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. That is exactly what I was looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, so the current code looks fine.

One thing, I am doubting: If we delete a file, will the kernel free all those pages asap? If yes, why do we do the whole "sequentical" handling for merges at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may add another abstract method to NativeAcess that is called when we close a file. On Linux we could there give some instructions like "forget everything".

@uschindler
Copy link
Contributor Author

uschindler commented Mar 22, 2024

I refactored the code a bit:

  • The POSIX constants are now part of PosixNativeAcess class
  • The abstract method madvise() now only takes a MemorySegment and the IOContext. So it is now the hook for native implementations to implement the "advise" given by the IOContext for the platform (POSIX, Windows, maybe special case for Linux which has more constants if you use plain low-level non-posix madvise()). On Linux you can also tell e.g., on closing a file that you don't need the pages anymore.
  • I removed OptionalInt: The values are all small and are in the valueOf cache, so we just return null when no madvice is needed

One thing: Should we remove the NoopNativeAccess class and just add a null check?

… use are Poxix only. On Windows we can have a totally different way to madvise the kernel
jpountz and others added 2 commits March 22, 2024 11:49
* Set randomAccess=true on LOAD.

* Javadocs

* Cleanup code and move IOContext mapping to Posix, as the constants we use are Poxix only. On Windows we can have a totally different way to madvise the kernel

---------

Co-authored-by: Uwe Schindler <uschindler@apache.org>
Co-authored-by: Uwe Schindler <uwe@thetaphi.de>
@uschindler
Copy link
Contributor Author

Thanks for the PR contributions. You may also commit directly to this branch, unless the change is a complete rewrite :-)

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Mar 22, 2024

I took another run at static final ;-). uschindler#5 (if we still don't want it, then I'll drop it)

[EDIT] it's in PR form, since I'm not sure that @uschindler wants it! ;-)

@uschindler
Copy link
Contributor Author

Maybe it is easier to see results on benchmarking when it is in main branch. I am waiting for final review by @jpountz and then merge this. Backporting to 9.x is also planned and should be done before #13205 is applied.

I will merge this tomorrow afternoon CET.

@uschindler
Copy link
Contributor Author

@jpountz Are you fine with merging?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed!

@uschindler uschindler merged commit a4055da into apache:main Mar 25, 2024
3 checks passed
@uschindler uschindler deleted the dev/posix_madvise branch March 25, 2024 17:44
@uschindler
Copy link
Contributor Author

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

asfgit pushed a commit that referenced this pull request Mar 25, 2024
asfgit pushed a commit that referenced this pull request Mar 25, 2024
@uschindler
Copy link
Contributor Author

I fixed the test in ae5d353

asfgit pushed a commit that referenced this pull request Mar 26, 2024
@uschindler
Copy link
Contributor Author

I also removed the extra logging included while development from the main branch. In 9.x the log message was adapted to list both features together with the sysprop to disable).

@ChrisHegarty
Copy link
Contributor

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

Oops, sorry about this. Thankfully the fix was straightforward.

@uschindler
Copy link
Contributor Author

uschindler commented Mar 26, 2024

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

Oops, sorry about this. Thankfully the fix was straightforward.

About this test and the quite large 8 MB buffer (see https://github.com/apache/lucene/pull/13196/files#diff-ebee319f3691cdb1627e5e9b1dfbdd0c266b1da28ccf0b2a9218dee1d34ff2b7R103):

Is this some limit inside the kernel to trigger something? For this test, we call posix_madvise() always.

Maybe there was a misunderstanding: We only do not call madvise, if the chunk size is too small (< 2 MiB), but by default the chunk size is 16 Gigabytes, so except for TestMultiMMap which uses smaller chnk sizes to test chunking logic, we always call madvise. Theres only one special case: If the file is zero length, then we can't call it.

@ChrisHegarty
Copy link
Contributor

I dunno what I was thinking, this is clearly not correct. I opened #13214 to fix the test. ( apologies for the stupid test issues! )

@jpountz
Copy link
Contributor

jpountz commented Mar 26, 2024

@uschindler Should we open a separate issue for adding fadvise support to NIOFSDirectory?

@uschindler
Copy link
Contributor Author

Unfortunately fadvise is at moment close to impossible. Reason: we have no file handle!

Chances are good that we also get a Java-based fadvise some time in the future (e.g., through an OpenOption like with O_DIRECT).

@uschindler
Copy link
Contributor Author

As disussed before, for implementing fadvise for reading/writing files, we would need to write a full stack of IO layer natively (OutputStream for writing and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771

@uschindler
Copy link
Contributor Author

Anyways we can open an issue to track what's going on on the JDK (listing all relevant issue numbers like the above one).

@uschindler
Copy link
Contributor Author

uschindler commented Mar 28, 2024

As disussed before, for implementing fadvise for reading/writing files, we would need to write a full stack of IO layer natively (OutputStream for writing and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771

Here is the posix_fadvise specific issue: https://bugs.openjdk.org/browse/JDK-8329256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants