Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

uschindler · 2023-01-02T18:35:03Z

Elasticsearch Version

any

Installed Plugins

n/a

Java Version

bundled

OS Version

Linux

Problem Description

When setting up a cluster Elasticsearch requires to set vm.map_max_count to an arbitrary high value. This conusmes kernel resources and is totally useless for normal installations with a few hundred shards per node.

We know that there is currently a limit that Lucene needs to split maps into chunks of 1 GiB, but also with that limitation the Linux default of 65530 is way enough for most installations. If somebody hits the limit, it is easy to change, but enforcing that strict and by default is.... sorry: bullshit. I have to say it hard!

Instead please enable Java 19 by default (already done) and use the JVM option --enable-preview (before Lucene 9.5, see #90526). With Lucene 9.5 we will use 16 GiB chunks by default and without preview mode so having that setting enforced is useless anyways.

If you still want to print a warning, I am fine, but letting users not go into production without this obviously wrong setting enforced as a wrkaround for some seldom problem is a desaster!

Steps to Reproduce

Start Elasticsearch in cloud mode without setting vm.max_map_count in sysctl.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

jpountz · 2023-01-03T08:12:54Z

For context, this issue is forked from the Lucene user list where a user cannot set the vm.max_map_count because they are running on Azure Container Instances (ACI): https://lists.apache.org/thread/44fyr1kn73hfs3z4m03s9s8y1t9nz9nf.

jpountz · 2023-01-03T09:55:20Z

In my opinion, the two main factors that contribute to high numbers of map regions are:

We want to use separate indexes or data streams per dataset instead of mixing unrelated data together. But some datasets have little data and result in small segments.
Users want to fill large disks with data. If you have a 5TB disk and segments have 10 mmap files (doc, dvd, fdx, fdt, kdi, kdd, nvd, pos, tip, tim), having 65530 maps means your segments must be at least 5TB * 10 / 65530 = 800MB on average? Compound files and the hybrid directory that only uses mmap for a subset of the files make this number a bit lower in practice.

Having to change the vm.max_map_count up-front is annoying, but hitting an out-of-memory error at runtime is even worse? The bigger chunks with Panama are good, but I don't think we can say that it no longer make sense to require bumping the number of allowed map regions thanks to this change alone. It still feels easy to hit the 65k limit with a single map region per file. Maybe there are other changes we could make that would reduce the pain around this, thinking out loud:

Maybe the required value for vm.max_map_count could be made dependent on the total amount of storage on the node. This way, it would only be required to bump vm.max_map_count on bigger nodes.
Maybe we should use compound files up to a fixed threshold (e.g. 1GB) instead of using them for files that are more than 10% of the size of a shard. This would reduce the contribution of small shards to the number of map regions.
Maybe HybridDirectory should not use mmap on tiny segments?
Maybe rollover should automatically package some merging in order to merge together the long tail of small segments?

elasticsearchmachine · 2023-01-03T10:05:11Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

uschindler · 2023-01-03T14:39:38Z

Hi Adrien,
thanks for the feedback! I agree that the panama larger segments won't help if you have many small files < 1 GiB. I agree with your proposals:

Make the check depend on the number of files (not index size). It is just hard to fingure out on bootstrapping how many files a node uses. What happens if you add more indexes after startup? One solution for this would be to remove the check from bootstrapping an instead print warnings to log file as soon as the problem comes close (like if file system gets full). Then users have enough time to do a rolling restart after applying setting. I just have a problem with a bootstrap check!
Compound files by default can be configured to only be used for small sizes and Lucene defaults to it. Maybe you cahnged in Elasticsearch. Or alternatively use NIOFSIndexInput for them. The check could be added in the FileSwitchDir subclass.

Having to change the vm.max_map_count up-front is annoying, but hitting an out-of-memory error at runtime is even worse?

I do not agree with this. It takes quite a long time until this happens and most users won't be affected by this. I would just print the warning on startup or maybe complain at runtime if the system figures out that you have many indexes with many files open. If somebody hits this, they get an IOException like "disk full" or similar. We no longer have the bad "OutOfMemoryError" mentioned since long time, so the exception should not confuse anybody. The Excaption also gives a good message what needs to be done or what would cause this. Lucene's commits are working well, so the node may stop working (but this can also happen when disk space is full). Just stop node, change setting in your sysctl and start it again. It should replay the transactions and all is fine again. Stuff like this happened very often to me, so I am glad that Elasticsearch works well on those error conditions.

In short: My problem is only that you can't start the node at all when you do not have the setting applied. My only wish is to instead print a warning and maybe show the warning also in Kibana, but refusing to start is a no-go. The same applied in the past to settings like enforcing "swappinness=0", although everybody around me suggests to use "swappinness=1" or disable swap file at all. If the user wants to configure the cluser in his own way, let him do it. Enforcing users to do something just to not shoot them into the foot is fine to a certain degree, but there must always be a way around checks. I patched that startup error out of an older Elasticsearch cluster used for library search with only a few indexes, because I don't agree to apply those settings on my kernel, sorry.

uschindler · 2023-01-03T15:18:02Z

BTW, here is the code that throws a nice IOException when the bad thing happens: https://github.com/apache/lucene/blob/19cc6cdf6671a6bdf8dbef66303b6b659574ba05/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L307-L348

jpountz · 2023-01-03T15:25:16Z

you can't start the node at all when you do not have the setting applied

For reference, it is still possible to set node.store.allow_mmap: false to not have to set vm.max_map_count. This isn't fully satisfactory as it will prevent using mmap at all and use NIOFSDirectory, but at least it allows running Elasticsearch.

Lucene's `TieredMergePolicy`'s default consists of creating compound files for segments that exceed 10% of the total index size at the time of writing these segments. This sort of default makes sense in the context of a single Lucene index, but much less for a system like Elasticsearch where a node may host heterogeneous shards: it doesn't make sense to decide to use a coumpound file for a 1GB segment on one shard because it has 30GB of data, and not on another shard because it has 5GB of data. This change does the following: - `index.compound_format` now accepts byte size values, in addition to a `boolean` or ratio in [0,1]. - The default value of `index.compound_format` changes from `0.1` to `1gb`. Said otherwise, segments will be stored in compound files if they don't exceed 1GB. In practice, this means that smaller shards will use compound files much more and contribute less to the total number of open files or map regions, while the bigger shards may use compound files a bit less. Relates elastic#92618

Lucene's `TieredMergePolicy`'s default consists of creating compound files for segments that exceed 10% of the total index size at the time of writing these segments. This sort of default makes sense in the context of a single Lucene index, but much less for a system like Elasticsearch where a node may host heterogeneous shards: it doesn't make sense to decide to use a coumpound file for a 1GB segment on one shard because it has 30GB of data, and not on another shard because it has 5GB of data. This change does the following: - `index.compound_format` now accepts byte size values, in addition to a `boolean` or ratio in [0,1]. - The default value of `index.compound_format` changes from `0.1` to `1gb`. Said otherwise, segments will be stored in compound files if they don't exceed 1GB. In practice, this means that smaller shards will use compound files much more and contribute less to the total number of open files or map regions, while the bigger shards may use compound files a bit less. Relates #92618

yanchanglj · 2023-01-10T01:54:56Z

uschindler's proposal is a good idea. Give users the right to choose, not be forced.

justmike1 · 2023-09-20T22:09:10Z

I ended up going with what @jpountz described and it (8.10.1) works, what are the downsides of that? from my own research I don't see anything major other than different way of working, but why one is the default instead of the other?

why isn't NIOFSDirectory the default?

uschindler added >bug needs:triage Requires assignment of a team area label labels Jan 2, 2023

iverase added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels Jan 3, 2023

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 3, 2023

jpountz mentioned this issue Jan 4, 2023

Set a fixed compound file threshold of 1GB. #92659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

uschindler commented Jan 2, 2023 •

edited

Loading

jpountz commented Jan 3, 2023

jpountz commented Jan 3, 2023

elasticsearchmachine commented Jan 3, 2023

uschindler commented Jan 3, 2023

uschindler commented Jan 3, 2023

jpountz commented Jan 3, 2023

yanchanglj commented Jan 10, 2023

justmike1 commented Sep 20, 2023 •

edited

Loading

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

Comments

uschindler commented Jan 2, 2023 • edited Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

jpountz commented Jan 3, 2023

jpountz commented Jan 3, 2023

elasticsearchmachine commented Jan 3, 2023

uschindler commented Jan 3, 2023

uschindler commented Jan 3, 2023

jpountz commented Jan 3, 2023

yanchanglj commented Jan 10, 2023

justmike1 commented Sep 20, 2023 • edited Loading

uschindler commented Jan 2, 2023 •

edited

Loading

justmike1 commented Sep 20, 2023 •

edited

Loading