Recommend lowering the default mmap readahead. #13223

jpountz · 2024-03-27T09:59:36Z

This is a follow-up of a discussion on #13219. mmap has a higher readahead than regular read() operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size of the page cache, this may trigger performance issues due to page cache trashing and additional page cache contention. Rather than forcing MMapDirectory to use MADV_RANDOM on all files, it would make more sense to configure a lower mmap readahead at the system level, e.g. the same readahead value as read() operations use.

This is a follow-up of a discussion on apache#13219. `mmap` has a higher readahead than regular `read()` operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size of the page cache, this may trigger performance issues due to page cache trashing and additional page cache contention. Rather than forcing `MMapDirectory` to use `MADV_RANDOM` on all files, it would make more sense to configure a lower `mmap` readahead at the system level, e.g. the same readahead value as `read()` operations use.

uschindler

If you fix the html violations all fine.

rmuir · 2024-03-28T11:11:05Z

lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java

@@ -38,6 +38,15 @@
 * fragmented address space. If you get an {@link IOException} about mapping failed, it is
 * recommended to reduce the chunk size, until it works.
 *
+ * <p><b>NOTE</b>: On some platforms like Linux, mmap comes with a higher readahead than regular
+ * read() operations, e.g. 128kB for mmap reads and 16kB for regular reads. Such a high default


Can you point to place in the kernel where this is happening?

rmuir

I don't think we should ask users to modify these settings on their block devices, at least i'd like to see actual documentation on why this should be adjusted for lucene (also it will impact the entire system)

uschindler · 2024-03-28T11:25:30Z

I am also a bit skeptical why you need to modify the block device. If this would be a file system setting I can imagine it's useful.

@rmuir this came from investigation by Wikimedia on Elasticsearch elastic/elasticsearch#27748

rmuir · 2024-03-28T15:41:31Z

my thoughts here are that issues can be addressed by providing correct advice to madvise. IMO this should typically be MADV_RANDOM because accesses are in random order: even if "we" think of it as sequential, we are sharing single filemap across multiple threads and they seek to some random place in file and read a little and are done, that's RANDOM!

we should benchmark this stuff to get it right.

jpountz · 2024-03-28T17:38:07Z

For reference, this change is based on similar observations as made on https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations. mmap comes with a 128kB readahead while read() only does 16kB readahead. I can reproduce the exact same numbers with code as below, plus dropping caches and using vmtouch.

  public static void main(String[] args) throws Exception {
    try (FSDirectory dir = FSDirectory.open(Paths.get("/data/a")); // switch to NIOFSDirectory to test with read()
        IndexInput in = dir.openInput("term-ids__47.tmp", IOContext.READ)) {
      in.readInt();
    }
  }

While 16kB has proved workable in practice, we've seen major performance issues with Elasticsearch, a 128kB readahead and indexes that exceed the size of the page cache. My first take was that 128kB feels huge for a default readahead, almost buggy, and it's not clear to me why it's so much higher than with read(). Since this is controversial, I'm ok with the alternative approach of using a MADV_RANDOM all the time for IOContext.READ. We should benchmark the impact of a smaller readahead to confirm it performs well, from my testing it only reads one page at a time in that case, but intuitively it should be ok.

jpountz · 2024-03-29T09:32:37Z

the alternative approach of using a MADV_RANDOM all the time for IOContext.READ

I opened #13244 to show what this could look like.

mikemccand · 2024-04-01T16:26:14Z

I was trying to understand exactly how modern Linux kernels handle readahead, and uncovered this interesting and enlightening summary of a recent-ish discussion about how the kernel does it today, and how to maybe improve it. I had no idea that the kernel dynamically adjusts the readahead size depending on how the application's IO is behaving, cool.

mikemccand · 2024-04-01T16:34:41Z

The Linux source for readahead is quite wild (WARNING: GPL 2 code -- read at your own risk!): https://github.com/torvalds/linux/blob/master/mm/readahead.c

jpountz · 2024-04-04T09:56:36Z

Superseded by #13244.

jpountz requested a review from uschindler March 27, 2024 09:59

uschindler approved these changes Mar 27, 2024

View reviewed changes

tidy

3dde0d8

rmuir reviewed Mar 28, 2024

View reviewed changes

rmuir requested changes Mar 28, 2024

View reviewed changes

jpountz closed this Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommend lowering the default mmap readahead. #13223

Recommend lowering the default mmap readahead. #13223

jpountz commented Mar 27, 2024

uschindler left a comment

rmuir Mar 28, 2024

rmuir left a comment

uschindler commented Mar 28, 2024

rmuir commented Mar 28, 2024

jpountz commented Mar 28, 2024

jpountz commented Mar 29, 2024

mikemccand commented Apr 1, 2024

mikemccand commented Apr 1, 2024

jpountz commented Apr 4, 2024

Recommend lowering the default mmap readahead. #13223

Recommend lowering the default mmap readahead. #13223

Conversation

jpountz commented Mar 27, 2024

uschindler left a comment

Choose a reason for hiding this comment

rmuir Mar 28, 2024

Choose a reason for hiding this comment

rmuir left a comment

Choose a reason for hiding this comment

uschindler commented Mar 28, 2024

rmuir commented Mar 28, 2024

jpountz commented Mar 28, 2024

jpountz commented Mar 29, 2024

mikemccand commented Apr 1, 2024

mikemccand commented Apr 1, 2024

jpountz commented Apr 4, 2024