-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default GC_Thresh params too low for large environments #54
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/161541346 The labels on this github issue will be updated when the story is started. |
Hi friends, We added a doc with helpful hints for using cf-networking & silk with large deployments. https://github.com/cloudfoundry/cf-networking-release/blob/develop/docs/large_deployments.md Let us know what you think. I expect it will be updated more as you start using our features :) Is there anything else we can do for this issue? Thanks, |
Hi all, We are going to close this due to inactivity. Let us know if we can help you more in any way. Best, |
Hi cf-networkers! Here is an issue.
Issue
On larger environments (~600+ cells seems to be the tipping point) we experience severe performance degradation once cf-networking is enabled. This can be resolved by increasing various kernel parameters, listed below. I'm opening an issue here rather than PRing a fix because I'm not super sure whether this is best done in the stemcell or in cf-networking and I'd like to get your opinion (I also have no idea how to test this, sorry :-( ).
Context
The default (stemcell) values for
net.ipv4.neigh.default.gc_thresh1
,net.ipv4.neigh.default.gc_thresh2
andnet.ipv4.neigh.default.gc_thresh3
cause very high cpu load on environments with larger numbers of cells once container networking is enabled, leading to app crashes and system instability.Steps to Reproduce
Scale to a large (~600ish+) number of nodes (sorry - not sure of an easier way to reproduce this unless you can expand the ARP cache size some other way)
Expected result
The system should not experience instability even with large numbers of cells.
Current result
Once enough cells join the overlay cpu load becomes very large and apps start crashing (due to health check failures). Kernel logs show
neighbour: arp_cache: neighbor table overflow
.Possible Fix
The following
sysctl
parameters fix the problem:Picture of what William Shatner would look like debugging this
The text was updated successfully, but these errors were encountered: