Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend lowering the default mmap readahead. #13223

Closed

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Mar 27, 2024

This is a follow-up of a discussion on #13219. mmap has a higher readahead than regular read() operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size of the page cache, this may trigger performance issues due to page cache trashing and additional page cache contention. Rather than forcing MMapDirectory to use MADV_RANDOM on all files, it would make more sense to configure a lower mmap readahead at the system level, e.g. the same readahead value as read() operations use.

This is a follow-up of a discussion on apache#13219. `mmap` has a higher readahead
than regular `read()` operations by default, e.g. 128kB instead of 16kB on my
Linux box. On indexes that exceed the size of the page cache, this may trigger
performance issues due to page cache trashing and additional page cache
contention. Rather than forcing `MMapDirectory` to use `MADV_RANDOM` on all
files, it would make more sense to configure a lower `mmap` readahead at the
system level, e.g. the same readahead value as `read()` operations use.
@jpountz jpountz requested a review from uschindler March 27, 2024 09:59
Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you fix the html violations all fine.

@@ -38,6 +38,15 @@
* fragmented address space. If you get an {@link IOException} about mapping failed, it is
* recommended to reduce the chunk size, until it works.
*
* <p><b>NOTE</b>: On some platforms like Linux, mmap comes with a higher readahead than regular
* read() operations, e.g. 128kB for mmap reads and 16kB for regular reads. Such a high default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point to place in the kernel where this is happening?

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ask users to modify these settings on their block devices, at least i'd like to see actual documentation on why this should be adjusted for lucene (also it will impact the entire system)

@uschindler
Copy link
Contributor

I am also a bit skeptical why you need to modify the block device. If this would be a file system setting I can imagine it's useful.

@rmuir this came from investigation by Wikimedia on Elasticsearch elastic/elasticsearch#27748

@rmuir
Copy link
Member

rmuir commented Mar 28, 2024

my thoughts here are that issues can be addressed by providing correct advice to madvise. IMO this should typically be MADV_RANDOM because accesses are in random order: even if "we" think of it as sequential, we are sharing single filemap across multiple threads and they seek to some random place in file and read a little and are done, that's RANDOM!

we should benchmark this stuff to get it right.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 28, 2024

For reference, this change is based on similar observations as made on https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations. mmap comes with a 128kB readahead while read() only does 16kB readahead. I can reproduce the exact same numbers with code as below, plus dropping caches and using vmtouch.

  public static void main(String[] args) throws Exception {
    try (FSDirectory dir = FSDirectory.open(Paths.get("/data/a")); // switch to NIOFSDirectory to test with read()
        IndexInput in = dir.openInput("term-ids__47.tmp", IOContext.READ)) {
      in.readInt();
    }
  }

While 16kB has proved workable in practice, we've seen major performance issues with Elasticsearch, a 128kB readahead and indexes that exceed the size of the page cache. My first take was that 128kB feels huge for a default readahead, almost buggy, and it's not clear to me why it's so much higher than with read(). Since this is controversial, I'm ok with the alternative approach of using a MADV_RANDOM all the time for IOContext.READ. We should benchmark the impact of a smaller readahead to confirm it performs well, from my testing it only reads one page at a time in that case, but intuitively it should be ok.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 29, 2024

the alternative approach of using a MADV_RANDOM all the time for IOContext.READ

I opened #13244 to show what this could look like.

@mikemccand
Copy link
Member

I was trying to understand exactly how modern Linux kernels handle readahead, and uncovered this interesting and enlightening summary of a recent-ish discussion about how the kernel does it today, and how to maybe improve it. I had no idea that the kernel dynamically adjusts the readahead size depending on how the application's IO is behaving, cool.

@mikemccand
Copy link
Member

The Linux source for readahead is quite wild (WARNING: GPL 2 code -- read at your own risk!): https://github.com/torvalds/linux/blob/master/mm/readahead.c

@jpountz
Copy link
Contributor Author

jpountz commented Apr 4, 2024

Superseded by #13244.

@jpountz jpountz closed this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants