Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

Closed
micoq opened this issue Dec 10, 2017 · 27 comments
Closed

Avoid file cache trashing on Linux with mmapfs by using madvise() ? #27748

micoq opened this issue Dec 10, 2017 · 27 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team team-discuss

Comments

@micoq
Copy link

micoq commented Dec 10, 2017

With mmapfs, search queries load more data than necessary in the page cache. By default, every memory mapping done with mmap() (and FileChannel.map() in Java) on Linux is expected to be read (almost) sequentially. However, when a search request is done, the inverted index and other data structures seem to be read randomly so the system loads extra memory pages before and after the necessary page.
This results in a lot of I/O from the storage to warm up the file cache. In addition the cache is filling with unnecessary data and can evict more quickly the hot pages, slowing the next requests.
The problem is more visible with big indices (~1TB in our case).

To avoid this, Linux provides the madvise() syscall to change the prefetching behavior of memory maps. You can tell the system to avoid loading pages by using this with the flag MADV_RANDOM.
Unfortunately, Java doesn't use this syscall. Lucene provides a native library to do this : org.apache.lucene.store.NativePosixUtil but it doesn't seem to be used.

To illustrate this, I made some tests on readonly indices (~60GB) with a batch of search requests (bool requests on some fields with size=0, just document count). Each index have been optimized with _forcemerge:

                            Warm                  Cold
                       madv   mmap    nio   madv   mmap   nio

query  1               8276   9100   9422  13967  10487 10769
query  2                  9     10      9     95   1031    28
query  3                403    774    753   1019   1267   839
query  4                428    852    739    702   1025   857
query  5               4003   5591   5580   7970   6778  5947
query  6               1608   2237   2567   2611   2511  2594
query  7               5154   7193   7476   7890   7204  7943
query  8                438    705    707   1110   1211   793
query  9               2824   3922   4377   4143   4400  4237
query 10               2313   3235   3073   3086   3262  3471
average                2545   3361   3470   4259   3917  3747

consumed cache (Mio)      -      -      -   1607   7659  4687
storage I/O (Mio/s)       0      0      0    ~30   ~250  ~150

Each column represents a single test and results are in ms:

  • "cold" is made after a fresh startup and empty caches (echo 3 > /proc/sys/vm/drop_caches)
  • "warm" is the same test made right after the first one

The query cache and the request cache have been disabled.
The storage is made of spinning disks.
Elasticsearch version: 5.5.0.
OS: RHEL 7.3

You can see mmapfs is consuming more cache and IO than niofs.

In the madv column, I patched Lucene (MMapDirectory) to execute madvise(MADV_RANDOM) on each mapped file. This further improve the file cache and I/O consumption. In addition, the search are faster on warmed data.
To do this, I just add a single line in MMapDirectory.java:

final ByteBuffer[] map(String resourceDescription, FileChannel fc, long offset, long length) throws IOException {
...
  try {
    buffer = fc.map(MapMode.READ_ONLY, offset + bufferStart, bufSize);
    NativePosixUtil.madvise(buffer,NativePosixUtil.RANDOM); // here !
  } catch (IOException ioe) {
    throw convertMapFailedIOException(ioe, resourceDescription, bufSize);
  }
...
  return buffers;
}

Then I compile the shared native library libNativePosixUtil.so (with Lucene sources):

cd lucene-6.6.0/misc
ant build-native-unix

And finally, starts Elasticsearch with -Djava.library.path=/.../lucene-6.6.0/build/native/NativePosixUtil.so in jvm.options.

I didn't know if this solution can be applied in all cases and I didn't test all the cases (replication, merging, other queries...) but it could explain why mmapfs badly perform on large setups for searching. Some users reported a similar behavior like here, here or here.

I didn't know if there is a similar problem on Windows since it's memory management is different.

@jasontedor
Copy link
Member

I think this should be opened as a Lucene issue?

@s1monw
Copy link
Contributor

s1monw commented Dec 15, 2017

Lucene already has a directory for this but it requires native code etc. You can go and use it, adding a custom directory to ES with a plugin is pretty straight forward. Since Java 10 will add the ability to add O_DIRECT to streams / channels I think I'd want to wait for this and add it as an optional thing we can use if you run on Java 10 (which will come early next year I assume).

I hope this makes sense @micoq I will close this for now. we can still reopen if needed. Please feel free to continue the discussion here.

@s1monw s1monw closed this as completed Dec 15, 2017
@s1monw s1monw removed the discuss label Dec 15, 2017
@s1monw s1monw self-assigned this Dec 15, 2017
@micoq
Copy link
Author

micoq commented Dec 15, 2017

Actually, I didn't tested NativeUnixDirectory yet.
O_DIRECT could helps to reduce the cache trashing while merging/writing segments (and maybe for the translog or the shards restoration ?). It's a good idea to integrate this directly into the JVM.

In my case, the bottleneck was especially about the read operations while searching documents. I wanted to know why mmapfs performed badly on large setups contrary to niofs (since mmapfs is theoretically better because it avoid extra copies into an user space buffer).

@s1monw
Copy link
Contributor

s1monw commented Dec 18, 2017

In my case, the bottleneck was especially about the read operations while searching documents. I wanted to know why mmapfs performed badly on large setups contrary to niofs (since mmapfs is theoretically better because it avoid extra copies into an user space buffer).

that is truly interesting. It almost seems that with madvice the OS is a bit more diligent with mapping memory into the cache. So this is all pure guessing but in default mode mmap will do initiate quite a bit of read-ahead while in MADV_RANDOM mode it won't do any readahead. In the NIO case I guess the readahead overhead amortizes the extra copy to a userspace buffer, that the OS needs to do in the NIO case which is actually surprising since there is quite a bit of an overhead when doing normal file IO compared to mmaps. The latter has literally no syscalls involved (ideally) when reading from it. Also seeks are basically pointer manipulations. I wonder if the reads from NIO if indices / files are very large are faster due to the better usage of FS caches and amortized or rather prevented read-aheads.

This is quite an interesting place to do some research. I am convinced we won't ship any native code in core by default but there might be room for a plugin here.

@s1monw
Copy link
Contributor

s1monw commented Dec 18, 2017

@micoq can you tell how much memory your machine has and how much of it you are giving to elasticsearch when you run these tests?

@micoq
Copy link
Author

micoq commented Dec 18, 2017

@s1monw Sure.
This machine has 64 GB of memory (swap is disabled).

Elasticsearch is configured with a heap of 24GB:

jvm.options
-Xms24576m
-Xmx24576m

Compressed pointers are enabled:
[2017-12-18T18:05:47,129][INFO ][o.e.e.NodeEnvironment ] [cLvT3QE] heap size [24gb], compressed ordinary object pointers [true]

@micoq
Copy link
Author

micoq commented Dec 18, 2017

I didn't do the test with NIO but if you try to map a big file and touch a single byte every 2Mbytes, Linux will load the entire 2MB chunk from the disk. Here is a quick and dirty example to illustrate this (do a drop caches before executing it):

public static void main(String[] args) throws IOException {
    MMapDirectory dir = new MMapDirectory(Paths.get("/home/me/test"));
    dir.setPreload(false); // just to be sure
    IOContext ctx = new IOContext();
    IndexInput in = dir.openInput("bigfile.dat", ctx);
    long pos = 0;
    while(true) {
	    in.seek(pos);
	    //pos += 2 << 11; // 4 MB/s (a byte every single page or 4096 bytes)
	    //pos += 2 << 12; // 8 MB/s (a byte every 2 pages or 8192 bytes)
	    //pos += 2 << 13; // 16 MB/s
	    //pos += 2 << 14; // 32 MB/s
	    //pos += 2 << 15;  // 64 MB/s
	    //pos += 2 << 16;  // 128 MB/s
	    //pos += 2 << 17;  // 256 MB/s
	    //pos += 2 << 18;  // 512 MB/s (this is a theoretical value for 1ms delay)
	    //pos += 2 << 19;  // 1024 MB/s
	    pos += 2 << 20;  // 2048 MB/s
            //pos += 2 << 21; // 4096 MB/s
            //pos += 2 << 22; // 4096 MB/s (the speed doesn't increase after this)
	    try {
		    Thread.sleep(1); // set this to 10 or 100 if the storage doesn't keeps up
	    } catch (InterruptedException e) {}
	    in.readByte();
    }
}

Note that on OpenJDK 8 and Linux, if you preload the file with dir.setPreload(true) (the load() method from MappedByteBuffer: https://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html), the JVM will execute madvise(MADV_WILLNEED) on the whole mapping (it's the only case where Java seems to use madvise()). On Windows, the JVM just touch all the pages by reading a single value on every page (like the example above with the first line uncommented). (MappedByteBuffer.c in the JDK source)

madvise(MADV_RANDOM) effectively disables readaheads. Unfortunately, I didn't find any other way (i.e. without a native library) to change the kernel behavior.

@childe
Copy link

childe commented Jun 7, 2018

@micoq maybe you can use blockdev --setra to change the kernel behavior.

@micoq
Copy link
Author

micoq commented Jun 7, 2018

@childe Thank you but unfortunately this doesn't have any effect (I tried different values, the default was 8192 blocks in my case). The bottleneck came from the memory management read ahead which is different from the block device read ahead.

The memory management read ahead is only used for mapped files and the block device read ahead for all read operations on storage devices whether using niofs or mmapfs.

In fact, the 2MB read ahead limit I observed in my tests seems to be hard coded in the kernel:
https://github.com/torvalds/linux/blob/master/mm/readahead.c#L236

@pmoust pmoust reopened this Nov 7, 2018
@colings86 colings86 added the :Core/Infra/Core Core issues without another label label Nov 9, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@micoq
Copy link
Author

micoq commented Nov 12, 2018

Hello,

I've just uploaded a plugin for Elasticsearch which implements memory mapping with madvise() and direct I/O for merges. It's available here: https://github.com/micoq/native-unix-store

@azuresky11
Copy link

Hello,micoq
I have a question about the native-unix-store plugin,I did two tests,
The two test conditions are the same, they are all tested under the condition that the memory is not enough.
in these two test cases, the environment and parameters are the same. I want to know why it spends more time after installing native-unix-store plugin? thanks

@micoq
Copy link
Author

micoq commented Mar 17, 2019

Hello @azuresky11,

Can you provide some details about the configuration, the queries and the dataset you used to run these tests ?

  • OS, Elasticsearch, Java and plugin versions
  • CPU / RAM / Storage
  • involved queries in your tests (read only, writes, only indexing... ?)
  • index settings (did you used other settings than "index.store.type":"nativeunixfs" in you second test ?)
  • dataset (size, number of shards)

Are the results the same if you enable the plugin but you disable mmap in the test index with "index.store.mmap.enabled": false ?

Finally, it's possible the queries you made read data more sequentially than randomly so the madvise(RANDOM) optimization will not be efficient. Let's try with "index.store.mmap.read_ahead": true

@azuresky11
Copy link

azuresky11 commented Mar 19, 2019

@micoq
This is my configuration:
1、CentOS Linux release 7.4.1708 (Core),elasticsearch-6.6.0,openjdk11, plugin versions:native-unix-store-6.6.0-1.0.0
2、40U 20core, virtual machine created on the server. The memory size is 16g,storage is sata 4T.
3、Test case is segment merge(but i find only one core is used while i set 40U).This is my command:curl -XPOST http://162.19.33.112:9206/es*/_forcemerge?max_num_segments=1.
4、I tried this setting :"index.store.type":“mmapfs", It takes less time.
5、Dataset:80g,5shards,0 replicas
I will try your suggestion,thanks for you reply

@micoq
Copy link
Author

micoq commented Mar 19, 2019

@azuresky11

It's interesting :
By default, nativeunixfs uses the same IO method as niofs for reading segments in a merge context (if you don't use directIO) whereas mmapfs uses the memory mapped file (both use the same method for writing).
nativeunixfs only uses the memory mapped file for search queries. This could explain the difference.

So niofs would be less efficient than mmapfs for merges. In this case, you should have the same result with niofs or nativeunixfs on your merge test.

You could try directIO for merges but it's not necessary better in terms of speed (it only save the filesystem cache).

If I'm not mistaken, the _forcemerge operation always uses a single thread/core.

@azuresky11
Copy link

@micoq
Which performance is better for niofs and mmapfs after installing the plugin? (For writing and aggregation)

@micoq
Copy link
Author

micoq commented Mar 21, 2019

@azuresky11
The plugin didn't change the behavior of niofs ("index.store.type":"niofs" in index settings) or mmapfs ("index.store.type":"mmapfs" in index settings).
It only be used if "index.store.type":"nativeunixfs" is set in the index settings.

Anyway, by default mmapfs is theoretically better than niofs on access time since it avoids an extra copy of data between the kernel space and the user space (to fill a buffer).
However, mmapfs tend to load more data than necessary and can perform worse than niofs. That's why I wrote this plugin.

@azuresky11
Copy link

Hello @micoq
I used ssd to do several query tests, the dataset is 1t, when I set "index.store.type": "mmapfs" and "index.store.type": "nativeunixfs", the query time of these two settings almost the same.Do you know why this result?

@micoq
Copy link
Author

micoq commented Apr 1, 2019

Hello @azuresky11,
The benefits of the plugin depend heavily on the data structure of your index and your filesystem cache usage. It is difficult to predict the behavior without performing some tests in real conditions.

Basically, if your queries do a lot of random accesses on cold data "nativeunixfs" will loads far less data from the storage: you will see a lower throughput (with iotop) and the filesystem cache will keep old data for a longer time.

As you can see on my first post: on a single test the average time is not necessary better than "mmapfs". However, with "nativeunixfs", it loads ~1.5 GB of data and ~7.4 GB with "mmapfs" for the same results so 5.9 GB of cache wasted by the requests.

In some cases, the saved filesystem cache can be used to serve other requests and finally improve the global response time (in my case, I use Elasticsearch for logging with constant indexing and many parallel requests).

In you case, maybe "mmapfs" will be sufficient (and you will not have to manage the native libraries). If you have no performance drops, you can keep it.

You can also try the new "hybridfs" store included in the last version of Elasticsearch (6.7.0):
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html
This store combine "niofs" and "mmapfs".

Just a little (m)advice for your tests: between each test, don't forget to drop the caches with
echo 3 > /proc/sys/vm/drop_caches
after closing the indices and before reopening them with the new store type or the data will not be dropped from the cache.
(doing this command while a mapped file is loaded in the cache will have no effect !)

@azuresky11
Copy link

Thankyou @micoq
I did a write test using esrally, this is my test environment, arm architecture, 4cpu, 16g memory, -Xms8g, -Xmx8g, the data set size I wrote is 3g,between each test,i restarted OS, test results:

  mmapfs niofs nativeunixfs
Median Throughput(docs/s) 33449.35 33186.3 31830.4
I did a lot of tests and the results are the same.According to the results, the plugin performance is not effective.What is wrong here?

@micoq
Copy link
Author

micoq commented Apr 19, 2019

@s1monw Ok, I found why mmapfs loads too much data: it was... the readahead of the filesystem ! (@childe you was right !).

However, it is not that simple !

To begin, the 2MB size hardcoded in the kernel are not the readahead size but the maximum size of the data the kernel can load at once from the storage while performing a readahead operation
https://github.com/torvalds/linux/blob/master/mm/readahead.c#L236
If necessary, the kernel can load more than one chunk of 2MB.

Unfortunately, the readahead size of the Elasticsearch data partition was also 2MB on all my testing machines...

In addition, using blockdev --setra on the whole disk is not sufficient. You must set it at the partition level to be taken into account (e.g. on a LVM logicial volume).
For some filesystems, you cannot use blockdev and have to modify it directly in /sys. For example, with btrfs, you can modify the readahead here: /sys/devices/virtual/bdi/btrfs-1/read_ahead_kb

Some system have a small readahead by default (usually 128KB) but on large installations with hardware RAID you can have a large readahead (4MB).

In Lucene, some files are read with many "jumps" or "holes" like the .doc files (links between terms and documents containing the terms). With a large readahead, these holes are filled with data even if we don't need it.

A more graphical example:

The file before any request:  -------------------------------------------------------
The data actually accessed:   RR---------------------------R--R-RR-R---R-----R--R--RR
The data loaded in the cache: RRRR-------------------------RRRRRRRRRRRRRRRR--RRRRRRRR

This happens on a .doc with a query on a message field containing a lot of terms. It could be reproduced on public data with a query like this: field:*s*
(I know the leading wildcard is not optimal but it's easier to reproduce the behavior !)

For some reasons, the readahead is always maximal on mapped memory (mmapfs) and not with standard I/O accesses (niofs). This can explain the poor performance with mmapfs on some deployments.
Here are some other people who had a similar problem: https://phabricator.wikimedia.org/T169498
and found similar solutions:

  • calling madvise(MADV_RANDOM) on the mapped files (with ptrace !)
  • changing the readahead of the data partition (easier !)
    madvise() always reduce the readahead to 0: only 1 page of 4KB is read at a time.

Now, this will not eliminate the cache consumption while merging and it's not always the best choice to use a small readahead (or madvise()) but it could help to improve the performance on large clusters.

@jasontedor
Copy link
Member

@micoq We analyzed the same problem with mmapfs around six months ago and came to the same conclusions. This is why we have introduced hybridfs (#36668) to use NIO when we expect the access pattern to be random such that sequential read-ahead would be painful and otherwise use mmap. We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for. For filesystem cache consumption while merging, I am planning now that the master branch of Lucene exposes JDK 10 to investigate using O_DIRECT when merging.

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@malpani
Copy link
Contributor

malpani commented Dec 5, 2020

With https://issues.apache.org/jira/browse/LUCENE-8982 introducing a pure java based DirectIODirectory, what are your thoughts on adding directfs as one of the store types in Elasticsearch?

@vsop-479
Copy link
Contributor

vsop-479 commented Feb 2, 2024

We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for.

Is there any progress on this plan?

@uschindler
Copy link
Contributor

See apache/lucene#13196, which is going into Lucene 9.11. It is used with Java 21 or later.

@uschindler
Copy link
Contributor

uschindler commented Mar 26, 2024

We are also planning to make a contribution to the JVM to expose madvise(MADV_RANDOM) so that we can then return to using mmap everywhere for.

Is there any progress on this plan?

With recent Java 21/22 changes around project Panama this is no longer needed as you can pass a MemorySegment (used by new version of MMapDirectory directly to native code using a MethodHandle. See above PR.

The problem is more fadvise() to reduce impact on merging or when using NIOFSDir. fadvise needs a file descriptor, so native support in the JDK is a requirement. There's already a discussion going on on the OpenJDK bug tracker: https://bugs.openjdk.org/browse/JDK-8292771

@jpountz
Copy link
Contributor

jpountz commented Mar 26, 2024

Thanks @uschindler for closing the loop. I'm closing this issue in favor of the Lucene and JDK issues that you shared above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team team-discuss
Projects
None yet
Development

No branches or pull requests