Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set proxy net.Dial KeepAlive to 1 second #258

Merged
merged 3 commits into from
Oct 28, 2019

Conversation

tlwr
Copy link
Contributor

@tlwr tlwr commented Oct 27, 2019

A short explanation of the proposed change:

Configure proxy's net.Dialer to probe idle connections every 1s instead of every 15s.

An explanation of the use cases your change solves

When gorouter is configured to keep connections alive (DisableKeepAlives: false) then it will keep idle TCP connections open to backends for up to 90 seconds. If the connection is not used after 90 seconds then it is closed; if a new request comes in, and there is a idle connection to the backend ready, then it will be used instead of a new TCP connection being made.

If a backend endpoint closes a connection which has been kept alive at
the same time as gorouter tries to use the connection, then Gorouter
returns a 502 and logs backend-endpoint-failed due to an io.EOF on the socket.

The underlying TCP connection is managed by net.Dialer which by default probes the status of the underlying connection every 15s, which communicates to the recipient that the connection should still be kept open.

Some apps running in our deployment have keep-alives configured to be less than the 15s default probe interval.

Enabling connection keep alives in our foundry caused gorouter to return 502 (backend-endpoint-failed) for approximately ~0.01% of requests to backends which had short keepalive durations.

It did not appear to have a substantial effect on apps with longer keepalive durations, although I have not conclusively tested this.

Now that routing-release (since 0.192) defaults keep-alives to backend connections to be on, it may be sensible to have a shorter probe interval, to avoid these hard to diagnose networking issues.

This PR configures the default probe interval of the underlying net.Dialer in the same style as EndpointDialTimeout; it adds a new
configuration parameter EndpointKeepAliveProbeInterval which configures the underlying net.Dialer KeepAlive parameter, and overrides the default of 15s to be 1s.

I chose 1s somewhat arbitrarily, but it has the property of being lower than the 2s default keepalive duration that some of the pathological apps deployed to our CF have.

This should not adversely affect the performance of gorouter, however that it will send 1 probe per connection idle connection every one second, instead of 1 probe per connection every 15 seconds.

Instructions to functionally test the behavior change

I found this quite hard to test as it only appears at scale, when traffic fluctuates, for requests which cannot be retried.

I would appreciate advice in how to write a test case for this (perhaps a soak test in one of CFF's long lived environments might do the trick?).

Expected result after the change

When I deploy this change, when keepalive's to backends are enabled, I will not see 0.01% of requests 502 with backend-endpoint-failed due to keepalives.

The net.Dialer probe will signal to the backend to keep the connection open.

Current result before the change

We have deployed routing-release with max_idle_connections: 2500 twice, and reverted it twice. When max_idle_connections is non-zero then gorouter will keep up to 100 idle connections (hardcoded) open to backends.

The backends may close the connection while gorouter is trying to use them.

Since routing-release 0.192 the default value for max_idle_connections is non-zero, so I expect to see other deployments experience this.

Links to any other associated PRs

This issue #209 has similar symptoms but the cause is different

  • I have viewed signed and have submitted the Contributor License Agreement

  • I have made this pull request to the master branch

  • I have run all the unit tests using bin/test

  • I have run CF Acceptance Tests on bosh lite

When gorouter is configured to keep connections alive
(DisableKeepAlives=false) then it will keep idle TCP connections open to
backends for up to 90 seconds. If the connection is not used after 90
seconds then it is closed; if a new request comes in, and there is a
idle connection to the backend ready, then it will be used instead of a
new TCP connection being made.

The symptoms of this are very similar to this issue:
cloudfoundry#209 ; but the cause is
different.

If a backend endpoint closes a connection which has been kept alive at
the same time as gorouter tries to use the connection, then Gorouter
returns a 502 and logs 'backend-endpoint-failed' due to an io.EOF on the
socket.

The underlying TCP connection is managed by net.Dialer which by default
probes the status of the underlying connection every 15s, which
communicates to the recipient that the connection should still be kept
open.

Some apps running in our foundry have keepalives configured to be less
than the 15s default probe interval. Enabling connection keep alives in
our foundry caused gorouter to return 502 (backend-endpoint-failed) for
approximately ~0.01% of requests to backends which had short keepalive
durations. It did not appear to have a substantial effect on apps with
longer keepalive durations, although I have not conclusively tested
this.

This only affects requests which cannot be retried (e.g. POST)

Now that routing-release (since 0.192) defaults keep-alives to backend
connections to be on, it may be sensible to have a shorter probe
interval, to avoid these hard to diagnose networking issues.

This commit configures the default probe interval of the underlying
net.Dialer in the same style as EndpointDialTimeout; it adds a new
configuration parameter EndpointKeepAliveProbeInterval which configures
the underlying net.Dialer KeepAlive parameter, and overrides the default
of 15s to be 1s.

I chose 1s somewhat arbitrarily, but it has the property of being lower
than the 2s default keepalive duration that some of the pathological
apps deployed to our CF have.

This should not adversely affect the performance of gorouter, however that it
will send 1 probe per connection idle connection every one second,
instead of 1 probe per connection every 15 seconds.

Signed-off-by: toby lorne <toby@toby.codes>
@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/169384608

The labels on this github issue will be updated when the story is started.

@tlwr
Copy link
Contributor Author

tlwr commented Oct 27, 2019

CLA bot didn't appear but I have signed both the new (#257) and the old CLA (#251)

@KauzClay
Copy link
Contributor

Hi @tlwr ,

Thanks for this PR. The reasoning and code look good to us. However, we would appreciate if you could add a unit test for the config change (similar to this).

Happy to merge afterwards!

Thanks
@KauzClay and @angelachin

@tlwr
Copy link
Contributor Author

tlwr commented Oct 28, 2019

Hi, thanks!

I've added a test that checks that the default is set appropriately.

The test that you have linked to tests that the parsed configuration is set appropriately.

The first commits changes the configuration setting, but does not make it configurable from the config file, in the same way as EndpointDialTimeout which can be configured on the struct only.

I am happy to, as well as the test I have just written, enable it as a configurable option (endpoint_keep_alive_probe_interval), and write a test that it can also be configured from YAML

:)

Edit: I see no reason to not make it configurable although a separate PR will have to be raised in routing-release. I'll make the change...

Edit of edit: Done, once this is merged shall I raise a corresponding PR in routing-release to allow it to be configured, or omit it for now (like max_idle_conns_per_host)?

Toby Lorne added 2 commits October 28, 2019 17:34
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
@tlwr tlwr force-pushed the set-keepalive-timeout branch from a5cdfb8 to b44b55a Compare October 28, 2019 17:35
@KauzClay
Copy link
Contributor

@tlwr Wow we were just drafting a response to say yes, configurable sounds like a good move, but you did it anyway! Thanks!

@KauzClay KauzClay merged commit 8339efb into cloudfoundry:master Oct 28, 2019
@KauzClay
Copy link
Contributor

KauzClay commented Oct 28, 2019

@tlwr we merged. We will commit that into routing release so it points to a version of gorouter with your change, so it will be there when you PR routing-release

Edit: routing-release develop now points to your gorouter changes! cloudfoundry/routing-release@a5eb78a

@tlwr tlwr deleted the set-keepalive-timeout branch October 28, 2019 18:01
@tlwr
Copy link
Contributor Author

tlwr commented Oct 28, 2019

We will commit that into routing release so it points to a version of gorouter with your change, so it will be there when you PR routing-release

Awesome, I'll see you in a PR in routing-release 👋 🏃

tlwr pushed a commit to alphagov/paas-routing-release that referenced this pull request Oct 28, 2019
In cloudfoundry/gorouter#258 we customised and
made configurable the endpoint_keep_alive_probe_interval property in
gorouter.

This commit allows the user to configure the
endpoint_keep_alive_probe_interval using the
router.keep_alive_probe_interval property, and adds a default so
documentation is clear.

In order to confirm that my logic was correct I added some tests

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants