-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set proxy net.Dial KeepAlive to 1 second #258
Conversation
When gorouter is configured to keep connections alive (DisableKeepAlives=false) then it will keep idle TCP connections open to backends for up to 90 seconds. If the connection is not used after 90 seconds then it is closed; if a new request comes in, and there is a idle connection to the backend ready, then it will be used instead of a new TCP connection being made. The symptoms of this are very similar to this issue: cloudfoundry#209 ; but the cause is different. If a backend endpoint closes a connection which has been kept alive at the same time as gorouter tries to use the connection, then Gorouter returns a 502 and logs 'backend-endpoint-failed' due to an io.EOF on the socket. The underlying TCP connection is managed by net.Dialer which by default probes the status of the underlying connection every 15s, which communicates to the recipient that the connection should still be kept open. Some apps running in our foundry have keepalives configured to be less than the 15s default probe interval. Enabling connection keep alives in our foundry caused gorouter to return 502 (backend-endpoint-failed) for approximately ~0.01% of requests to backends which had short keepalive durations. It did not appear to have a substantial effect on apps with longer keepalive durations, although I have not conclusively tested this. This only affects requests which cannot be retried (e.g. POST) Now that routing-release (since 0.192) defaults keep-alives to backend connections to be on, it may be sensible to have a shorter probe interval, to avoid these hard to diagnose networking issues. This commit configures the default probe interval of the underlying net.Dialer in the same style as EndpointDialTimeout; it adds a new configuration parameter EndpointKeepAliveProbeInterval which configures the underlying net.Dialer KeepAlive parameter, and overrides the default of 15s to be 1s. I chose 1s somewhat arbitrarily, but it has the property of being lower than the 2s default keepalive duration that some of the pathological apps deployed to our CF have. This should not adversely affect the performance of gorouter, however that it will send 1 probe per connection idle connection every one second, instead of 1 probe per connection every 15 seconds. Signed-off-by: toby lorne <toby@toby.codes>
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/169384608 The labels on this github issue will be updated when the story is started. |
Hi @tlwr , Thanks for this PR. The reasoning and code look good to us. However, we would appreciate if you could add a unit test for the config change (similar to this). Happy to merge afterwards! Thanks |
Hi, thanks! I've added a test that checks that the default is set appropriately. The test that you have linked to tests that the parsed configuration is set appropriately. The first commits changes the configuration setting, but does not make it configurable from the config file, in the same way as I am happy to, as well as the test I have just written, enable it as a configurable option ( :) Edit: I see no reason to not make it configurable although a separate PR will have to be raised in routing-release. I'll make the change... Edit of edit: Done, once this is merged shall I raise a corresponding PR in routing-release to allow it to be configured, or omit it for now (like |
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
a5cdfb8
to
b44b55a
Compare
@tlwr Wow we were just drafting a response to say yes, configurable sounds like a good move, but you did it anyway! Thanks! |
@tlwr we merged. We will commit that into routing release so it points to a version of gorouter with your change, so it will be there when you PR routing-release Edit: routing-release develop now points to your gorouter changes! cloudfoundry/routing-release@a5eb78a |
Awesome, I'll see you in a PR in routing-release 👋 🏃 |
In cloudfoundry/gorouter#258 we customised and made configurable the endpoint_keep_alive_probe_interval property in gorouter. This commit allows the user to configure the endpoint_keep_alive_probe_interval using the router.keep_alive_probe_interval property, and adds a default so documentation is clear. In order to confirm that my logic was correct I added some tests Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
A short explanation of the proposed change:
Configure proxy's
net.Dialer
to probe idle connections every 1s instead of every 15s.An explanation of the use cases your change solves
When gorouter is configured to keep connections alive (
DisableKeepAlives: false
) then it will keep idle TCP connections open to backends for up to 90 seconds. If the connection is not used after 90 seconds then it is closed; if a new request comes in, and there is a idle connection to the backend ready, then it will be used instead of a new TCP connection being made.If a backend endpoint closes a connection which has been kept alive at
the same time as gorouter tries to use the connection, then Gorouter
returns a 502 and logs
backend-endpoint-failed
due to an io.EOF on the socket.The underlying TCP connection is managed by
net.Dialer
which by default probes the status of the underlying connection every 15s, which communicates to the recipient that the connection should still be kept open.Some apps running in our deployment have keep-alives configured to be less than the 15s default probe interval.
Enabling connection keep alives in our foundry caused gorouter to return 502 (backend-endpoint-failed) for approximately ~0.01% of requests to backends which had short keepalive durations.
It did not appear to have a substantial effect on apps with longer keepalive durations, although I have not conclusively tested this.
Now that routing-release (since 0.192) defaults keep-alives to backend connections to be on, it may be sensible to have a shorter probe interval, to avoid these hard to diagnose networking issues.
This PR configures the default probe interval of the underlying net.Dialer in the same style as EndpointDialTimeout; it adds a new
configuration parameter EndpointKeepAliveProbeInterval which configures the underlying net.Dialer KeepAlive parameter, and overrides the default of 15s to be 1s.
I chose 1s somewhat arbitrarily, but it has the property of being lower than the 2s default keepalive duration that some of the pathological apps deployed to our CF have.
This should not adversely affect the performance of gorouter, however that it will send 1 probe per connection idle connection every one second, instead of 1 probe per connection every 15 seconds.
Instructions to functionally test the behavior change
I found this quite hard to test as it only appears at scale, when traffic fluctuates, for requests which cannot be retried.
I would appreciate advice in how to write a test case for this (perhaps a soak test in one of CFF's long lived environments might do the trick?).
Expected result after the change
When I deploy this change, when keepalive's to backends are enabled, I will not see 0.01% of requests 502 with backend-endpoint-failed due to keepalives.
The
net.Dialer
probe will signal to the backend to keep the connection open.Current result before the change
We have deployed routing-release with
max_idle_connections: 2500
twice, and reverted it twice. Whenmax_idle_connections
is non-zero then gorouter will keep up to 100 idle connections (hardcoded) open to backends.The backends may close the connection while gorouter is trying to use them.
Since routing-release 0.192 the default value for max_idle_connections is non-zero, so I expect to see other deployments experience this.
Links to any other associated PRs
This issue #209 has similar symptoms but the cause is different
I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the
master
branchI have run all the unit tests using
bin/test
I have run CF Acceptance Tests on bosh lite