Default GC_Thresh params too low for large environments #54

julz · 2018-10-29T08:42:32Z

Hi cf-networkers! Here is an issue.

Issue

On larger environments (~600+ cells seems to be the tipping point) we experience severe performance degradation once cf-networking is enabled. This can be resolved by increasing various kernel parameters, listed below. I'm opening an issue here rather than PRing a fix because I'm not super sure whether this is best done in the stemcell or in cf-networking and I'd like to get your opinion (I also have no idea how to test this, sorry :-( ).

Context

The default (stemcell) values for net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2 and net.ipv4.neigh.default.gc_thresh3 cause very high cpu load on environments with larger numbers of cells once container networking is enabled, leading to app crashes and system instability.

Steps to Reproduce

Scale to a large (~600ish+) number of nodes (sorry - not sure of an easier way to reproduce this unless you can expand the ARP cache size some other way)

Expected result

The system should not experience instability even with large numbers of cells.

Current result

Once enough cells join the overlay cpu load becomes very large and apps start crashing (due to health check failures). Kernel logs show neighbour: arp_cache: neighbor table overflow.

Possible Fix

The following sysctl parameters fix the problem:

sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048;

Picture of what William Shatner would look like debugging this

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2018-10-29T08:42:33Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161541346

The labels on this github issue will be updated when the story is started.

ameowlia · 2018-10-30T17:27:02Z

Hi friends,

We added a doc with helpful hints for using cf-networking & silk with large deployments.

https://github.com/cloudfoundry/cf-networking-release/blob/develop/docs/large_deployments.md

Let us know what you think. I expect it will be updated more as you start using our features :)

Is there anything else we can do for this issue?

Thanks,
Amelia and @nhsieh, CF Networking Team Members

nhsieh · 2018-11-20T00:02:22Z

Hi all,

We are going to close this due to inactivity. Let us know if we can help you more in any way.

Best,
Nancy and @ameowlia, CF Networking Team Members

cf-gitbot added the unscheduled label Oct 29, 2018

cf-gitbot added scheduled We agree this change makes sense and plan to work on it ourselves at some point. and removed unscheduled labels Nov 14, 2018

nhsieh closed this as completed Nov 20, 2018

cf-gitbot added delivered and removed scheduled We agree this change makes sense and plan to work on it ourselves at some point. delivered labels Nov 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default GC_Thresh params too low for large environments #54

Default GC_Thresh params too low for large environments #54

julz commented Oct 29, 2018

cf-gitbot commented Oct 29, 2018

ameowlia commented Oct 30, 2018

nhsieh commented Nov 20, 2018

Default GC_Thresh params too low for large environments #54

Default GC_Thresh params too low for large environments #54

Comments

julz commented Oct 29, 2018

Issue

Context

Steps to Reproduce

Expected result

Current result

Possible Fix

Picture of what William Shatner would look like debugging this

cf-gitbot commented Oct 29, 2018

ameowlia commented Oct 30, 2018

nhsieh commented Nov 20, 2018