-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommend lowering the default mmap readahead. #13223
Conversation
This is a follow-up of a discussion on apache#13219. `mmap` has a higher readahead than regular `read()` operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size of the page cache, this may trigger performance issues due to page cache trashing and additional page cache contention. Rather than forcing `MMapDirectory` to use `MADV_RANDOM` on all files, it would make more sense to configure a lower `mmap` readahead at the system level, e.g. the same readahead value as `read()` operations use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you fix the html violations all fine.
@@ -38,6 +38,15 @@ | |||
* fragmented address space. If you get an {@link IOException} about mapping failed, it is | |||
* recommended to reduce the chunk size, until it works. | |||
* | |||
* <p><b>NOTE</b>: On some platforms like Linux, mmap comes with a higher readahead than regular | |||
* read() operations, e.g. 128kB for mmap reads and 16kB for regular reads. Such a high default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to place in the kernel where this is happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ask users to modify these settings on their block devices, at least i'd like to see actual documentation on why this should be adjusted for lucene (also it will impact the entire system)
I am also a bit skeptical why you need to modify the block device. If this would be a file system setting I can imagine it's useful. @rmuir this came from investigation by Wikimedia on Elasticsearch elastic/elasticsearch#27748 |
my thoughts here are that issues can be addressed by providing correct advice to we should benchmark this stuff to get it right. |
For reference, this change is based on similar observations as made on https://biriukov.dev/docs/page-cache/3-page-cache-and-basic-file-operations. public static void main(String[] args) throws Exception {
try (FSDirectory dir = FSDirectory.open(Paths.get("/data/a")); // switch to NIOFSDirectory to test with read()
IndexInput in = dir.openInput("term-ids__47.tmp", IOContext.READ)) {
in.readInt();
}
} While 16kB has proved workable in practice, we've seen major performance issues with Elasticsearch, a 128kB readahead and indexes that exceed the size of the page cache. My first take was that 128kB feels huge for a default readahead, almost buggy, and it's not clear to me why it's so much higher than with read(). Since this is controversial, I'm ok with the alternative approach of using a MADV_RANDOM all the time for |
I opened #13244 to show what this could look like. |
I was trying to understand exactly how modern Linux kernels handle readahead, and uncovered this interesting and enlightening summary of a recent-ish discussion about how the kernel does it today, and how to maybe improve it. I had no idea that the kernel dynamically adjusts the readahead size depending on how the application's IO is behaving, cool. |
The Linux source for readahead is quite wild (WARNING: GPL 2 code -- read at your own risk!): https://github.com/torvalds/linux/blob/master/mm/readahead.c |
Superseded by #13244. |
This is a follow-up of a discussion on #13219.
mmap
has a higher readahead than regularread()
operations by default, e.g. 128kB instead of 16kB on my Linux box. On indexes that exceed the size of the page cache, this may trigger performance issues due to page cache trashing and additional page cache contention. Rather than forcingMMapDirectory
to useMADV_RANDOM
on all files, it would make more sense to configure a lowermmap
readahead at the system level, e.g. the same readahead value asread()
operations use.