Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that Transparent Huge Pages should be disabled on Linux #26551

Open
jakommo opened this issue Sep 8, 2017 · 17 comments
Open

Document that Transparent Huge Pages should be disabled on Linux #26551

jakommo opened this issue Sep 8, 2017 · 17 comments
Labels
>docs General docs changes :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Docs Meta label for docs team Team:Performance Meta label for performance team

Comments

@jakommo
Copy link
Contributor

jakommo commented Sep 8, 2017

There seems to be a Kernel issue https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056 that causes a Kernel crash during high load.
Also reported in discuss: https://discuss.elastic.co/t/elasticsearch-5-4-2-process-periodically-dying-with-high-cpu-load-and-kernel-message-pgtable-generic-c-33-bad-pmd/92239

It seems like this can be worked around by disabling THP on linux, e.g. echo -n never > /sys/kernel/mm/transparent_hugepage/enabled.

I had a chat with @jasontedor and we should recommend to disable THP in general, not just with the effected Kernel versions.
Seems like this is enabled by default on at least Ubuntu 14/16.04 and RHEL 6/7.

I think Important System Configuration would be a good place for this.

@jakommo jakommo added the >docs General docs changes label Sep 8, 2017
@javanna javanna added the help wanted adoptme label Sep 14, 2017
@jakommo
Copy link
Contributor Author

jakommo commented Sep 27, 2017

I talked to a user and they still experienced the above kernel bug after disabling THP, a lot less frequent though.
What seems to have solved it for them now is to also disable NUMA, so maybe we can add this as well (if there are no objections from dev side).

@jasontedor
Copy link
Member

As I mentioned in another channel, we should make the THP recommendation independent of any kernel bugs that may or may not be present.

As far as NUMA, our recommendations here require gathering more data and running some experiments. I don’t think we should base our recommendations on the basis of one data point (that might be fixed in different kernel versions).

@colings86 colings86 added the :Core/Infra/Core Core issues without another label label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@dliappis
Copy link
Contributor

dliappis commented Apr 24, 2018

One additional data point here, me and @danielmitterdorfer are working on a) evaluating the stability and b) performance behavior of Elasticsearch with and without THP. To be more exact, tuning involves not only testing THP enabled/false but also the defrag THP option, which as of kernel 4.6.1 offers new defrag strategies.

So far in our nightly benchmarking environment we have discovered that disabling THP (which in newer kernels is usually done by setting /sys/kernel/mm/transparent_hugepage/{defrag,enabled} to madvise) causes a performance drop in Elasticsearch.

Additionally, more recent versions of the Ubuntu kernel (starting with 4.12.2) are now setting THP to madvise from the enabled which used to be the default, which is how we became aware of the performance regression in the first place. madvise will also be the default setting in the upcoming Ubuntu 18.04 LTS release in a few days.

We will be providing more details when the necessary longrunning benchmarks have finished, backed by enough CI runs plus sufficient benchmarking data for a THP suggestion.

@alexander-marquardt
Copy link

alexander-marquardt commented Jul 2, 2018

MongoDB recommends against THP in the following document, so the same logic might apply to Elasticsearch if our access patterns are similar: https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

@pmoust
Copy link
Member

pmoust commented Jul 17, 2019

@dliappis did you get the chance to test more / dig deeper on this?

@dliappis
Copy link
Contributor

@pmoust unfortunately the deep dive work has been paused due to other activities. However, since my last comment some things haven't been documented so I'll summarize:

@jasontedor
Copy link
Member

I wouldn't expect stability issues with always, just long-tail latency spikes from transparent huge page defragging.

@jrodewig
Copy link
Contributor

jrodewig commented Nov 1, 2019

[docs issue triage]

@rjernst rjernst added Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team labels May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jaymode jaymode added :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Performance Meta label for performance team and removed :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team labels Dec 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-perf (Team:Performance)

@jimczi jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021
@zez3
Copy link

zez3 commented May 5, 2021

@dliappis
Is there somewhere a detailed description or the Cluster nodes used in elasticsearch-ci.elastic.co testing?
I am most interested in JVM heap sizes and per node memory usage during the tests.
Did the nodes used swap?
How much swap was used?

@lockewritesdocs
Copy link
Contributor

This issue hasn’t been updated in 3+ years so I’m closing it. We can revisit if needed.

@pmoust
Copy link
Member

pmoust commented May 24, 2022

I'd like us to have a look again into this @tomcallahan @dliappis, if you agree.
It's unclear to me what the exact recommendation is nowadays.

@pmoust pmoust reopened this May 24, 2022
@okwute419
Copy link

I'd like us to have a look again into this @tomcallahan @dliappis, if you agree. It's unclear to me what the exact recommendation is nowadays.

This issue hasn't been updated in almost 1yr, can we close this?

@fholzer
Copy link

fholzer commented May 1, 2023

Give the previous statement about CA. 7% performance degradation under certain circumstances I would assume it would be beneficial to get an officially recommendation on this topic an add the same.to the official documentation.

@dliappis dliappis removed their assignment Oct 26, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@danielmitterdorfer danielmitterdorfer removed their assignment Nov 20, 2023
@geekpete
Copy link
Member

Will the recent enhancements to Lucene 9.11.0 around madvise have any bearing on this issue once a stack release containing Lucene 9.11.0 is released? (8.14.0 uses Lucene 9.10.0)?

https://lucene.apache.org/core/9_11_0/changes/Changes.html#v9.11.0.new_features

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Docs Meta label for docs team Team:Performance Meta label for performance team
Projects
None yet
Development

No branches or pull requests