Document that Transparent Huge Pages should be disabled on Linux #26551

jakommo · 2017-09-08T16:52:27Z

There seems to be a Kernel issue https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056 that causes a Kernel crash during high load.
Also reported in discuss: https://discuss.elastic.co/t/elasticsearch-5-4-2-process-periodically-dying-with-high-cpu-load-and-kernel-message-pgtable-generic-c-33-bad-pmd/92239

It seems like this can be worked around by disabling THP on linux, e.g. echo -n never > /sys/kernel/mm/transparent_hugepage/enabled.

I had a chat with @jasontedor and we should recommend to disable THP in general, not just with the effected Kernel versions.
Seems like this is enabled by default on at least Ubuntu 14/16.04 and RHEL 6/7.

I think Important System Configuration would be a good place for this.

The text was updated successfully, but these errors were encountered:

jakommo · 2017-09-27T10:53:43Z

I talked to a user and they still experienced the above kernel bug after disabling THP, a lot less frequent though.
What seems to have solved it for them now is to also disable NUMA, so maybe we can add this as well (if there are no objections from dev side).

jasontedor · 2017-10-06T19:27:43Z

As I mentioned in another channel, we should make the THP recommendation independent of any kernel bugs that may or may not be present.

As far as NUMA, our recommendations here require gathering more data and running some experiments. I don’t think we should base our recommendations on the basis of one data point (that might be fixed in different kernel versions).

elasticmachine · 2018-04-24T09:19:53Z

Pinging @elastic/es-core-infra

dliappis · 2018-04-24T10:14:14Z

One additional data point here, me and @danielmitterdorfer are working on a) evaluating the stability and b) performance behavior of Elasticsearch with and without THP. To be more exact, tuning involves not only testing THP enabled/false but also the defrag THP option, which as of kernel 4.6.1 offers new defrag strategies.

So far in our nightly benchmarking environment we have discovered that disabling THP (which in newer kernels is usually done by setting /sys/kernel/mm/transparent_hugepage/{defrag,enabled} to madvise) causes a performance drop in Elasticsearch.

Additionally, more recent versions of the Ubuntu kernel (starting with 4.12.2) are now setting THP to madvise from the enabled which used to be the default, which is how we became aware of the performance regression in the first place. madvise will also be the default setting in the upcoming Ubuntu 18.04 LTS release in a few days.

We will be providing more details when the necessary longrunning benchmarks have finished, backed by enough CI runs plus sufficient benchmarking data for a THP suggestion.

alexander-marquardt · 2018-07-02T13:11:37Z

MongoDB recommends against THP in the following document, so the same logic might apply to Elasticsearch if our access patterns are similar: https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

pmoust · 2019-07-17T23:29:32Z

@dliappis did you get the chance to test more / dig deeper on this?

dliappis · 2019-07-19T10:30:28Z

@pmoust unfortunately the deep dive work has been paused due to other activities. However, since my last comment some things haven't been documented so I'll summarize:

The indexing performance drop in Elasticsearch when THP is set to madvise was observed only with the geonames track. The drop was approximately 7%. The performance didn't change in the remaining tracks.
To evaluate stability impact with THP set to always, all workers in elasticsearch-ci.elastic.co have /sys/kernel/mm/transparent_hugepage/{defrag,enabled} set to always and we haven't observed any obvious impact in stability.
All our nightly benchmark environments (see history of changes) have transparent_hugepage enabled and defrag settings set to always.

jasontedor · 2019-07-19T10:37:18Z

I wouldn't expect stability issues with always, just long-tail latency spikes from transparent huge page defragging.

jrodewig · 2019-11-01T19:14:27Z

[docs issue triage]

elasticmachine · 2020-12-14T21:13:49Z

Pinging @elastic/es-perf (Team:Performance)

zez3 · 2021-05-05T15:06:23Z

@dliappis
Is there somewhere a detailed description or the Cluster nodes used in elasticsearch-ci.elastic.co testing?
I am most interested in JVM heap sizes and per node memory usage during the tests.
Did the nodes used swap?
How much swap was used?

lockewritesdocs · 2022-04-27T15:08:01Z

This issue hasn’t been updated in 3+ years so I’m closing it. We can revisit if needed.

pmoust · 2022-05-24T08:59:12Z

I'd like us to have a look again into this @tomcallahan @dliappis, if you agree.
It's unclear to me what the exact recommendation is nowadays.

okwute419 · 2023-04-30T11:36:00Z

I'd like us to have a look again into this @tomcallahan @dliappis, if you agree. It's unclear to me what the exact recommendation is nowadays.

This issue hasn't been updated in almost 1yr, can we close this?

fholzer · 2023-05-01T14:38:21Z

Give the previous statement about CA. 7% performance degradation under certain circumstances I would assume it would be beneficial to get an officially recommendation on this topic an add the same.to the official documentation.

elasticsearchmachine · 2023-10-26T06:44:44Z

Pinging @elastic/es-docs (Team:Docs)

geekpete · 2024-06-10T00:02:10Z

Will the recent enhancements to Lucene 9.11.0 around madvise have any bearing on this issue once a stack release containing Lucene 9.11.0 is released? (8.14.0 uses Lucene 9.10.0)?

https://lucene.apache.org/core/9_11_0/changes/Changes.html#v9.11.0.new_features

jakommo added the >docs General docs changes label Sep 8, 2017

javanna added the help wanted adoptme label Sep 14, 2017

colings86 added the :Core/Infra/Core Core issues without another label label Apr 24, 2018

dliappis assigned dliappis and danielmitterdorfer Apr 24, 2018

danielmitterdorfer removed the help wanted adoptme label Apr 25, 2018

alpar-t mentioned this issue Jun 12, 2019

[CI] Task :x-pack:plugin:ccr:qa:multi-cluster-with-non-compliant-license:followClusterTestCluster#wait task fails #42583

Closed

rjernst added Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team labels May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

jaymode added :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Performance Meta label for performance team and removed :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team labels Dec 14, 2020

jimczi removed the needs:triage Requires assignment of a team area label label Jan 12, 2021

lockewritesdocs closed this as completed Apr 27, 2022

pmoust reopened this May 24, 2022

dliappis removed their assignment Oct 26, 2023

danielmitterdorfer removed their assignment Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document that Transparent Huge Pages should be disabled on Linux #26551

Document that Transparent Huge Pages should be disabled on Linux #26551

jakommo commented Sep 8, 2017

jakommo commented Sep 27, 2017

jasontedor commented Oct 6, 2017

elasticmachine commented Apr 24, 2018

dliappis commented Apr 24, 2018 •

edited

Loading

alexander-marquardt commented Jul 2, 2018 •

edited

Loading

pmoust commented Jul 17, 2019

dliappis commented Jul 19, 2019

jasontedor commented Jul 19, 2019

jrodewig commented Nov 1, 2019

elasticmachine commented Dec 14, 2020

zez3 commented May 5, 2021 •

edited

Loading

lockewritesdocs commented Apr 27, 2022

pmoust commented May 24, 2022

okwute419 commented Apr 30, 2023

fholzer commented May 1, 2023

elasticsearchmachine commented Oct 26, 2023

geekpete commented Jun 10, 2024

Document that Transparent Huge Pages should be disabled on Linux #26551

Document that Transparent Huge Pages should be disabled on Linux #26551

Comments

jakommo commented Sep 8, 2017

jakommo commented Sep 27, 2017

jasontedor commented Oct 6, 2017

elasticmachine commented Apr 24, 2018

dliappis commented Apr 24, 2018 • edited Loading

alexander-marquardt commented Jul 2, 2018 • edited Loading

pmoust commented Jul 17, 2019

dliappis commented Jul 19, 2019

jasontedor commented Jul 19, 2019

jrodewig commented Nov 1, 2019

elasticmachine commented Dec 14, 2020

zez3 commented May 5, 2021 • edited Loading

lockewritesdocs commented Apr 27, 2022

pmoust commented May 24, 2022

okwute419 commented Apr 30, 2023

fholzer commented May 1, 2023

elasticsearchmachine commented Oct 26, 2023

geekpete commented Jun 10, 2024

dliappis commented Apr 24, 2018 •

edited

Loading

alexander-marquardt commented Jul 2, 2018 •

edited

Loading

zez3 commented May 5, 2021 •

edited

Loading