Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

Open
uschindler opened this issue Jan 2, 2023 · 8 comments
Open

Stop enforcing arbitrary high value for sysctl vm.max_map_count #92618

uschindler opened this issue Jan 2, 2023 · 8 comments
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team

Comments

@uschindler
Copy link
Contributor

uschindler commented Jan 2, 2023

Elasticsearch Version

any

Installed Plugins

n/a

Java Version

bundled

OS Version

Linux

Problem Description

When setting up a cluster Elasticsearch requires to set vm.map_max_count to an arbitrary high value. This conusmes kernel resources and is totally useless for normal installations with a few hundred shards per node.

We know that there is currently a limit that Lucene needs to split maps into chunks of 1 GiB, but also with that limitation the Linux default of 65530 is way enough for most installations. If somebody hits the limit, it is easy to change, but enforcing that strict and by default is.... sorry: bullshit. I have to say it hard!

Instead please enable Java 19 by default (already done) and use the JVM option --enable-preview (before Lucene 9.5, see #90526). With Lucene 9.5 we will use 16 GiB chunks by default and without preview mode so having that setting enforced is useless anyways.

If you still want to print a warning, I am fine, but letting users not go into production without this obviously wrong setting enforced as a wrkaround for some seldom problem is a desaster!

Steps to Reproduce

Start Elasticsearch in cloud mode without setting vm.max_map_count in sysctl.

Logs (if relevant)

No response

@uschindler uschindler added >bug needs:triage Requires assignment of a team area label labels Jan 2, 2023
@jpountz
Copy link
Contributor

jpountz commented Jan 3, 2023

For context, this issue is forked from the Lucene user list where a user cannot set the vm.max_map_count because they are running on Azure Container Instances (ACI): https://lists.apache.org/thread/44fyr1kn73hfs3z4m03s9s8y1t9nz9nf.

@jpountz
Copy link
Contributor

jpountz commented Jan 3, 2023

In my opinion, the two main factors that contribute to high numbers of map regions are:

  • We want to use separate indexes or data streams per dataset instead of mixing unrelated data together. But some datasets have little data and result in small segments.
  • Users want to fill large disks with data. If you have a 5TB disk and segments have 10 mmap files (doc, dvd, fdx, fdt, kdi, kdd, nvd, pos, tip, tim), having 65530 maps means your segments must be at least 5TB * 10 / 65530 = 800MB on average? Compound files and the hybrid directory that only uses mmap for a subset of the files make this number a bit lower in practice.

Having to change the vm.max_map_count up-front is annoying, but hitting an out-of-memory error at runtime is even worse? The bigger chunks with Panama are good, but I don't think we can say that it no longer make sense to require bumping the number of allowed map regions thanks to this change alone. It still feels easy to hit the 65k limit with a single map region per file. Maybe there are other changes we could make that would reduce the pain around this, thinking out loud:

  • Maybe the required value for vm.max_map_count could be made dependent on the total amount of storage on the node. This way, it would only be required to bump vm.max_map_count on bigger nodes.
  • Maybe we should use compound files up to a fixed threshold (e.g. 1GB) instead of using them for files that are more than 10% of the size of a shard. This would reduce the contribution of small shards to the number of map regions.
  • Maybe HybridDirectory should not use mmap on tiny segments?
  • Maybe rollover should automatically package some merging in order to merge together the long tail of small segments?

@iverase iverase added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels Jan 3, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@uschindler
Copy link
Contributor Author

Hi Adrien,
thanks for the feedback! I agree that the panama larger segments won't help if you have many small files < 1 GiB. I agree with your proposals:

  • Make the check depend on the number of files (not index size). It is just hard to fingure out on bootstrapping how many files a node uses. What happens if you add more indexes after startup? One solution for this would be to remove the check from bootstrapping an instead print warnings to log file as soon as the problem comes close (like if file system gets full). Then users have enough time to do a rolling restart after applying setting. I just have a problem with a bootstrap check!
  • Compound files by default can be configured to only be used for small sizes and Lucene defaults to it. Maybe you cahnged in Elasticsearch. Or alternatively use NIOFSIndexInput for them. The check could be added in the FileSwitchDir subclass.

Having to change the vm.max_map_count up-front is annoying, but hitting an out-of-memory error at runtime is even worse?

I do not agree with this. It takes quite a long time until this happens and most users won't be affected by this. I would just print the warning on startup or maybe complain at runtime if the system figures out that you have many indexes with many files open. If somebody hits this, they get an IOException like "disk full" or similar. We no longer have the bad "OutOfMemoryError" mentioned since long time, so the exception should not confuse anybody. The Excaption also gives a good message what needs to be done or what would cause this. Lucene's commits are working well, so the node may stop working (but this can also happen when disk space is full). Just stop node, change setting in your sysctl and start it again. It should replay the transactions and all is fine again. Stuff like this happened very often to me, so I am glad that Elasticsearch works well on those error conditions.

In short: My problem is only that you can't start the node at all when you do not have the setting applied. My only wish is to instead print a warning and maybe show the warning also in Kibana, but refusing to start is a no-go. The same applied in the past to settings like enforcing "swappinness=0", although everybody around me suggests to use "swappinness=1" or disable swap file at all. If the user wants to configure the cluser in his own way, let him do it. Enforcing users to do something just to not shoot them into the foot is fine to a certain degree, but there must always be a way around checks. I patched that startup error out of an older Elasticsearch cluster used for library search with only a few indexes, because I don't agree to apply those settings on my kernel, sorry.

@uschindler
Copy link
Contributor Author

@jpountz
Copy link
Contributor

jpountz commented Jan 3, 2023

you can't start the node at all when you do not have the setting applied

For reference, it is still possible to set node.store.allow_mmap: false to not have to set vm.max_map_count. This isn't fully satisfactory as it will prevent using mmap at all and use NIOFSDirectory, but at least it allows running Elasticsearch.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Jan 4, 2023
Lucene's `TieredMergePolicy`'s default consists of creating compound files for
segments that exceed 10% of the total index size at the time of writing these
segments. This sort of default makes sense in the context of a single Lucene
index, but much less for a system like Elasticsearch where a node may host
heterogeneous shards: it doesn't make sense to decide to use a coumpound file
for a 1GB segment on one shard because it has 30GB of data, and not on another
shard because it has 5GB of data.

This change does the following:
 - `index.compound_format` now accepts byte size values, in addition to a
   `boolean` or ratio in [0,1].
 - The default value of `index.compound_format` changes from `0.1` to `1gb`.
   Said otherwise, segments will be stored in compound files if they don't
   exceed 1GB.

In practice, this means that smaller shards will use compound files much more
and contribute less to the total number of open files or map regions, while the
bigger shards may use compound files a bit less.

Relates elastic#92618
jpountz added a commit that referenced this issue Jan 4, 2023
Lucene's `TieredMergePolicy`'s default consists of creating compound files for
segments that exceed 10% of the total index size at the time of writing these
segments. This sort of default makes sense in the context of a single Lucene
index, but much less for a system like Elasticsearch where a node may host
heterogeneous shards: it doesn't make sense to decide to use a coumpound file
for a 1GB segment on one shard because it has 30GB of data, and not on another
shard because it has 5GB of data.

This change does the following:
 - `index.compound_format` now accepts byte size values, in addition to a
   `boolean` or ratio in [0,1].
 - The default value of `index.compound_format` changes from `0.1` to `1gb`.
   Said otherwise, segments will be stored in compound files if they don't
   exceed 1GB.

In practice, this means that smaller shards will use compound files much more
and contribute less to the total number of open files or map regions, while the
bigger shards may use compound files a bit less.

Relates #92618
@yanchanglj
Copy link

uschindler's proposal is a good idea. Give users the right to choose, not be forced.

@justmike1
Copy link

justmike1 commented Sep 20, 2023

I ended up going with what @jpountz described and it (8.10.1) works, what are the downsides of that? from my own research I don't see anything major other than different way of working, but why one is the default instead of the other?

why isn't NIOFSDirectory the default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

6 participants