Add support for posix_madvise to Java 21 MMapDirectory #13196

uschindler · 2024-03-21T14:57:39Z

This is a first idea how we can use Panama Foreign to pass madvise() hints to the kernel when mapping memory segments.

The code looks up the function pointer from stdlib (libc) on Linux and Macos (untested, but should work) and then invokes madvise() for all MemorySegments we have mmapped when the following is true:

IOContext#readOnce is set
The chunk size is large enough (at least 8192 means chunkSizePower>=13) - this prevents TestMultiMMap from failing because for very small mappings (as done by this test), the FileChannel#map call will produce unaligned memory segments (it uses some tricks and maps larger segments and returns slices - which are no longer pageSize aligned)
There is a noop implementation doing nothing which is choosen if you disable native access

Interestingly it works without any extra parameters to command line (at least in Java 21).

This is a draft only to do some performance tests and extend the IOContext interpretation to try out more possibilities. The current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in RAM. MADV_SEQUENTIAL instructs kernel to forget about the mappings and also do readahead which helps during merging.

jpountz · 2024-03-21T16:46:49Z

I remember this kind of things being discussed more than 10 years ago, it's extremely exciting to see it close to being included in the default Directory!

current "readOnce => MADV_SEQUENTIAL" is just an example as this is the main issue: We merge segments and don't want the soon to be trashed segments be sticky in

I believe that we don't use readOnce for merging, but for tiny metadata files that we load fully in memory. That said, I don't think that this should block this change, we can later look into the proper way to detect whether to pass MADV_SEQUENTIAL based on the IOContext.

lucene/core/src/java21/org/apache/lucene/store/NativeAccess.java

lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java

…tem dependent

uschindler · 2024-03-21T19:01:19Z

I refactored the code a bit:

@jpountz I added the merge context (I checked DirectIODirectory).
I retrieve the pageSize to check if a segment is correctly aligned. Unfortunately this is mega-stupid. I tend to remove it again (the constant _SC_PAGESIZE is part of a C enum and its value is different in various platforms as no fixed value is assigned) and maybe keep a hardcoded page size to handle the case. Maybe just check that it is a multiple of 64K (native page size is 8K, but may be different), because it is normally only an issue in this test....

uschindler · 2024-03-21T19:28:46Z

I removed the page size retrieval again. No need for it, we just be safe and only apply the advice, when the chunk size is large enough.

uschindler · 2024-03-21T19:39:16Z

Unfortunately after the problems with other constants I tend to hide the advice constants in the class any use some abstraction, so we can call

We should maybe also simply remove the NoopNativeAccess and do a null check.

lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java

lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java

lucene/core/src/java/org/apache/lucene/store/IOContext.java

uschindler · 2024-03-22T08:53:10Z

lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java

+    if (context.randomAccess) {
+      return OptionalInt.of(NativeAccess.POSIX_MADV_RANDOM);
+    }
+    if (context.readOnce || context.context == Context.MERGE) {


I think we should move the Context.MERGE check to be the first one, because when you merge segments you certainly always want to read the files with sequential advise (although the actual access may be random to a certain degree), because POSIX_MADV_SEQUENTIAL tells in its documentation: "Hence, pages in this region can be aggressively read ahead, and may be freed soon after they are accessed."

I moved the check for MERGE context to beginning in 242e5e9. This makes it always win. The reason for that is: We often set MERGE context, but the other settings keep their defaults (like random access). So when merging is first, we always bail out and advise kernel: "may be freed soon after they are accessed"

when you merge segments you certainly always want to read the files with sequential advise

I've not yet checked, but is Context.MERGE set when merging HNSW graphs? (which would favour random access)

HNSW merging is fine because the first thing that merging does it to write all vectors to a temporary file. And then it builds the graph on top of this temporary file. So it's ok for the input segments to read sequentially, and we could update the merging logic to open this temporary file with IOContext.RANDOM since it's the one that will have a random access pattern.

Perfect. That is exactly what I was looking for.

Thanks, so the current code looks fine.

One thing, I am doubting: If we delete a file, will the kernel free all those pages asap? If yes, why do we do the whole "sequentical" handling for merges at all?

We may add another abstract method to NativeAcess that is called when we close a file. On Linux we could there give some instructions like "forget everything".

This reverts commit d311b97.

uschindler · 2024-03-22T10:28:44Z

I refactored the code a bit:

The POSIX constants are now part of PosixNativeAcess class
The abstract method madvise() now only takes a MemorySegment and the IOContext. So it is now the hook for native implementations to implement the "advise" given by the IOContext for the platform (POSIX, Windows, maybe special case for Linux which has more constants if you use plain low-level non-posix madvise()). On Linux you can also tell e.g., on closing a file that you don't need the pages anymore.
I removed OptionalInt: The values are all small and are in the valueOf cache, so we just return null when no madvice is needed

One thing: Should we remove the NoopNativeAccess class and just add a null check?

… use are Poxix only. On Windows we can have a totally different way to madvise the kernel

* Set randomAccess=true on LOAD. * Javadocs * Cleanup code and move IOContext mapping to Posix, as the constants we use are Poxix only. On Windows we can have a totally different way to madvise the kernel --------- Co-authored-by: Uwe Schindler <uschindler@apache.org> Co-authored-by: Uwe Schindler <uwe@thetaphi.de>

uschindler · 2024-03-22T10:51:38Z

Thanks for the PR contributions. You may also commit directly to this branch, unless the change is a complete rewrite :-)

ChrisHegarty · 2024-03-22T10:53:37Z

I took another run at static final ;-). uschindler#5 (if we still don't want it, then I'll drop it)

[EDIT] it's in PR form, since I'm not sure that @uschindler wants it! ;-)

uschindler · 2024-03-24T12:43:00Z

Maybe it is easier to see results on benchmarking when it is in main branch. I am waiting for final review by @jpountz and then merge this. Backporting to 9.x is also planned and should be done before #13205 is applied.

I will merge this tomorrow afternoon CET.

uschindler · 2024-03-25T15:50:37Z

@jpountz Are you fine with merging?

jpountz

Yes indeed!

…o dev/posix_madvise

uschindler · 2024-03-25T19:26:21Z

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

…d after test run

uschindler · 2024-03-25T19:40:03Z

I fixed the test in ae5d353

…neral logging on MMapDirectory startup

uschindler · 2024-03-26T09:24:55Z

I also removed the extra logging included while development from the main branch. In 9.x the log message was adapted to list both features together with the sysprop to disable).

ChrisHegarty · 2024-03-26T09:27:23Z

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

Oops, sorry about this. Thankfully the fix was straightforward.

uschindler · 2024-03-26T09:50:49Z

The test added by @ChrisHegarty sometimes fails on windows: It does not close the file it opened for random access testing, so the directory can't be deleted. Will fix this in a separate commit.

Oops, sorry about this. Thankfully the fix was straightforward.

About this test and the quite large 8 MB buffer (see https://github.com/apache/lucene/pull/13196/files#diff-ebee319f3691cdb1627e5e9b1dfbdd0c266b1da28ccf0b2a9218dee1d34ff2b7R103):

Is this some limit inside the kernel to trigger something? For this test, we call posix_madvise() always.

Maybe there was a misunderstanding: We only do not call madvise, if the chunk size is too small (< 2 MiB), but by default the chunk size is 16 Gigabytes, so except for TestMultiMMap which uses smaller chnk sizes to test chunking logic, we always call madvise. Theres only one special case: If the file is zero length, then we can't call it.

ChrisHegarty · 2024-03-26T10:11:47Z

I dunno what I was thinking, this is clearly not correct. I opened #13214 to fix the test. ( apologies for the stupid test issues! )

jpountz · 2024-03-26T10:15:01Z

@uschindler Should we open a separate issue for adding fadvise support to NIOFSDirectory?

uschindler · 2024-03-26T10:17:43Z

Unfortunately fadvise is at moment close to impossible. Reason: we have no file handle!

Chances are good that we also get a Java-based fadvise some time in the future (e.g., through an OpenOption like with O_DIRECT).

uschindler · 2024-03-26T10:19:31Z

As disussed before, for implementing fadvise for reading/writing files, we would need to write a full stack of IO layer natively (OutputStream for writing and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771

uschindler · 2024-03-26T10:24:51Z

Anyways we can open an issue to track what's going on on the JDK (listing all relevant issue numbers like the above one).

uschindler · 2024-03-28T12:38:11Z

As disussed before, for implementing fadvise for reading/writing files, we would need to write a full stack of IO layer natively (OutputStream for writing and FileChannel for NIOFSDir). See https://bugs.openjdk.org/browse/JDK-8292771

Here is the posix_fadvise specific issue: https://bugs.openjdk.org/browse/JDK-8329256

Add support for posix_madvise to Java 21 MMapDirectory

670b06c

uschindler added the type:enhancement label Mar 21, 2024

uschindler requested review from rmuir and ChrisHegarty March 21, 2024 14:57

uschindler self-assigned this Mar 21, 2024

uschindler marked this pull request as draft March 21, 2024 14:57

uschindler mentioned this pull request Mar 21, 2024

Should we fold DirectIODirectory into FSDirectory? #13194

Closed

ChrisHegarty reviewed Mar 21, 2024

View reviewed changes

lucene/core/src/java21/org/apache/lucene/store/NativeAccess.java Show resolved Hide resolved

lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java Show resolved Hide resolved

uschindler added 2 commits March 21, 2024 19:43

Cleanup and retrieve page size on linux; TODO: the constant seems sys…

1ca24fe

…tem dependent

fix bug in warning

7cec218

Remove page size retrieval again and use a safe boundary (2 MiB)

9a6faca

fix logger name

4d9a442

ChrisHegarty reviewed Mar 21, 2024

View reviewed changes

lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInputProvider.java Outdated Show resolved Hide resolved

lucene/core/src/java21/org/apache/lucene/store/PosixNativeAccess.java Outdated Show resolved Hide resolved

ChrisHegarty and others added 2 commits March 21, 2024 20:28

static final MH

d311b97

Add IOContext.randomAccess (#3)

5873d7c

jpountz reviewed Mar 22, 2024

View reviewed changes

lucene/core/src/java/org/apache/lucene/store/IOContext.java Outdated Show resolved Hide resolved

uschindler commented Mar 22, 2024

View reviewed changes

uschindler added 2 commits March 22, 2024 10:16

Revert "static final MH"

9ed1da7

This reverts commit d311b97.

Make merge context always win

242e5e9

Cleanup code and move IOContext mapping to Posix, as the constants we…

a729f02

… use are Poxix only. On Windows we can have a totally different way to madvise the kernel

uschindler force-pushed the dev/posix_madvise branch from 62c708e to a729f02 Compare March 22, 2024 10:36

jpountz and others added 2 commits March 22, 2024 11:49

Another run at static final

c9e5ae4

uschindler added this to the 9.11.0 milestone Mar 24, 2024

uschindler mentioned this pull request Mar 24, 2024

Convert IOContext, MergeInfo, and FlushInfo to record classes #13205

Merged

uschindler requested a review from jpountz March 24, 2024 12:41

jpountz approved these changes Mar 25, 2024

View reviewed changes

uschindler added 2 commits March 25, 2024 18:37

Merge branch 'main' of https://gitbox.apache.org/repos/asf/lucene int…

94073e1

…o dev/posix_madvise

add CHANGES.txt

3919323

uschindler merged commit a4055da into apache:main Mar 25, 2024
3 checks passed

uschindler deleted the dev/posix_madvise branch March 25, 2024 17:44

uschindler mentioned this pull request Mar 25, 2024

Add support for posix_madvise to Java 21 MMapDirectory (backport) #13213

Merged

asfgit pushed a commit that referenced this pull request Mar 25, 2024

#13196, #13213: Fix test on windows to close file so it can be delete…

61e5da3

…d after test run

asfgit pushed a commit that referenced this pull request Mar 25, 2024

#13196, #13213: Fix test on windows to close file so it can be delete…

ae5d353

…d after test run

asfgit pushed a commit that referenced this pull request Mar 26, 2024

#13196, #13213: Remove logging relic from main branch

9fffc6e

asfgit pushed a commit that referenced this pull request Mar 26, 2024

#13196, #13213: Remove logging relic from 9.x branch and add it to ge…

a2ca63f

…neral logging on MMapDirectory startup

ChrisHegarty mentioned this pull request Mar 26, 2024

Avoid creating large buffer in TestMMapDirectory.testWithRandom #13214

Merged

This was referenced Mar 26, 2024

Avoid file cache trashing on Linux with mmapfs by using madvise() ? elastic/elasticsearch#27748

Closed

[BUG] Evaluate the performance of hybridfs against mmapfs opensearch-project/OpenSearch#8298

Open

geekpete mentioned this pull request Jun 10, 2024

Document that Transparent Huge Pages should be disabled on Linux elastic/elasticsearch#26551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for posix_madvise to Java 21 MMapDirectory #13196

Add support for posix_madvise to Java 21 MMapDirectory #13196

uschindler commented Mar 21, 2024

jpountz commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler Mar 22, 2024

uschindler Mar 22, 2024

ChrisHegarty Mar 22, 2024

jpountz Mar 22, 2024

ChrisHegarty Mar 22, 2024

uschindler Mar 22, 2024

uschindler Mar 22, 2024

uschindler commented Mar 22, 2024 •

edited

Loading

uschindler commented Mar 22, 2024

ChrisHegarty commented Mar 22, 2024 •

edited

Loading

uschindler commented Mar 24, 2024

uschindler commented Mar 25, 2024

jpountz left a comment

uschindler commented Mar 25, 2024

uschindler commented Mar 25, 2024

uschindler commented Mar 26, 2024

ChrisHegarty commented Mar 26, 2024

uschindler commented Mar 26, 2024 •

edited

Loading

ChrisHegarty commented Mar 26, 2024

jpountz commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 28, 2024 •

edited

Loading

Add support for posix_madvise to Java 21 MMapDirectory #13196

Add support for posix_madvise to Java 21 MMapDirectory #13196

Conversation

uschindler commented Mar 21, 2024

jpountz commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler commented Mar 21, 2024

uschindler Mar 22, 2024

Choose a reason for hiding this comment

uschindler Mar 22, 2024

Choose a reason for hiding this comment

ChrisHegarty Mar 22, 2024

Choose a reason for hiding this comment

jpountz Mar 22, 2024

Choose a reason for hiding this comment

ChrisHegarty Mar 22, 2024

Choose a reason for hiding this comment

uschindler Mar 22, 2024

Choose a reason for hiding this comment

uschindler Mar 22, 2024

Choose a reason for hiding this comment

uschindler commented Mar 22, 2024 • edited Loading

uschindler commented Mar 22, 2024

ChrisHegarty commented Mar 22, 2024 • edited Loading

uschindler commented Mar 24, 2024

uschindler commented Mar 25, 2024

jpountz left a comment

Choose a reason for hiding this comment

uschindler commented Mar 25, 2024

uschindler commented Mar 25, 2024

uschindler commented Mar 26, 2024

ChrisHegarty commented Mar 26, 2024

uschindler commented Mar 26, 2024 • edited Loading

ChrisHegarty commented Mar 26, 2024

jpountz commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 26, 2024

uschindler commented Mar 28, 2024 • edited Loading

uschindler commented Mar 22, 2024 •

edited

Loading

ChrisHegarty commented Mar 22, 2024 •

edited

Loading

uschindler commented Mar 26, 2024 •

edited

Loading

uschindler commented Mar 28, 2024 •

edited

Loading