Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default GC_Thresh params too low for large environments #54

Closed
julz opened this issue Oct 29, 2018 · 3 comments
Closed

Default GC_Thresh params too low for large environments #54

julz opened this issue Oct 29, 2018 · 3 comments

Comments

@julz
Copy link

julz commented Oct 29, 2018

Hi cf-networkers! Here is an issue.

Issue

On larger environments (~600+ cells seems to be the tipping point) we experience severe performance degradation once cf-networking is enabled. This can be resolved by increasing various kernel parameters, listed below. I'm opening an issue here rather than PRing a fix because I'm not super sure whether this is best done in the stemcell or in cf-networking and I'd like to get your opinion (I also have no idea how to test this, sorry :-( ).

Context

The default (stemcell) values for net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2 and net.ipv4.neigh.default.gc_thresh3 cause very high cpu load on environments with larger numbers of cells once container networking is enabled, leading to app crashes and system instability.

Steps to Reproduce

Scale to a large (~600ish+) number of nodes (sorry - not sure of an easier way to reproduce this unless you can expand the ARP cache size some other way)

Expected result

The system should not experience instability even with large numbers of cells.

Current result

Once enough cells join the overlay cpu load becomes very large and apps start crashing (due to health check failures). Kernel logs show neighbour: arp_cache: neighbor table overflow.

Possible Fix

The following sysctl parameters fix the problem:

sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096; 
sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048;

Picture of what William Shatner would look like debugging this

image

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/161541346

The labels on this github issue will be updated when the story is started.

@ameowlia
Copy link
Member

Hi friends,

We added a doc with helpful hints for using cf-networking & silk with large deployments.

https://github.com/cloudfoundry/cf-networking-release/blob/develop/docs/large_deployments.md

Let us know what you think. I expect it will be updated more as you start using our features :)

Is there anything else we can do for this issue?

Thanks,
Amelia and @nhsieh, CF Networking Team Members

@cf-gitbot cf-gitbot added scheduled We agree this change makes sense and plan to work on it ourselves at some point. and removed unscheduled labels Nov 14, 2018
@nhsieh
Copy link
Contributor

nhsieh commented Nov 20, 2018

Hi all,

We are going to close this due to inactivity. Let us know if we can help you more in any way.

Best,
Nancy and @ameowlia, CF Networking Team Members

@nhsieh nhsieh closed this as completed Nov 20, 2018
@cf-gitbot cf-gitbot added delivered and removed scheduled We agree this change makes sense and plan to work on it ourselves at some point. delivered labels Nov 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants