Set proxy net.Dial KeepAlive to 1 second #258

tlwr · 2019-10-27T16:07:12Z

A short explanation of the proposed change:

Configure proxy's net.Dialer to probe idle connections every 1s instead of every 15s.

An explanation of the use cases your change solves

When gorouter is configured to keep connections alive (DisableKeepAlives: false) then it will keep idle TCP connections open to backends for up to 90 seconds. If the connection is not used after 90 seconds then it is closed; if a new request comes in, and there is a idle connection to the backend ready, then it will be used instead of a new TCP connection being made.

If a backend endpoint closes a connection which has been kept alive at
the same time as gorouter tries to use the connection, then Gorouter
returns a 502 and logs backend-endpoint-failed due to an io.EOF on the socket.

The underlying TCP connection is managed by net.Dialer which by default probes the status of the underlying connection every 15s, which communicates to the recipient that the connection should still be kept open.

Some apps running in our deployment have keep-alives configured to be less than the 15s default probe interval.

Enabling connection keep alives in our foundry caused gorouter to return 502 (backend-endpoint-failed) for approximately ~0.01% of requests to backends which had short keepalive durations.

It did not appear to have a substantial effect on apps with longer keepalive durations, although I have not conclusively tested this.

Now that routing-release (since 0.192) defaults keep-alives to backend connections to be on, it may be sensible to have a shorter probe interval, to avoid these hard to diagnose networking issues.

This PR configures the default probe interval of the underlying net.Dialer in the same style as EndpointDialTimeout; it adds a new
configuration parameter EndpointKeepAliveProbeInterval which configures the underlying net.Dialer KeepAlive parameter, and overrides the default of 15s to be 1s.

I chose 1s somewhat arbitrarily, but it has the property of being lower than the 2s default keepalive duration that some of the pathological apps deployed to our CF have.

This should not adversely affect the performance of gorouter, however that it will send 1 probe per connection idle connection every one second, instead of 1 probe per connection every 15 seconds.

Instructions to functionally test the behavior change

I found this quite hard to test as it only appears at scale, when traffic fluctuates, for requests which cannot be retried.

I would appreciate advice in how to write a test case for this (perhaps a soak test in one of CFF's long lived environments might do the trick?).

Expected result after the change

When I deploy this change, when keepalive's to backends are enabled, I will not see 0.01% of requests 502 with backend-endpoint-failed due to keepalives.

The net.Dialer probe will signal to the backend to keep the connection open.

Current result before the change

We have deployed routing-release with max_idle_connections: 2500 twice, and reverted it twice. When max_idle_connections is non-zero then gorouter will keep up to 100 idle connections (hardcoded) open to backends.

The backends may close the connection while gorouter is trying to use them.

Since routing-release 0.192 the default value for max_idle_connections is non-zero, so I expect to see other deployments experience this.

Links to any other associated PRs

This issue #209 has similar symptoms but the cause is different

I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the master branch
I have run all the unit tests using bin/test
I have run CF Acceptance Tests on bosh lite

When gorouter is configured to keep connections alive (DisableKeepAlives=false) then it will keep idle TCP connections open to backends for up to 90 seconds. If the connection is not used after 90 seconds then it is closed; if a new request comes in, and there is a idle connection to the backend ready, then it will be used instead of a new TCP connection being made. The symptoms of this are very similar to this issue: cloudfoundry#209 ; but the cause is different. If a backend endpoint closes a connection which has been kept alive at the same time as gorouter tries to use the connection, then Gorouter returns a 502 and logs 'backend-endpoint-failed' due to an io.EOF on the socket. The underlying TCP connection is managed by net.Dialer which by default probes the status of the underlying connection every 15s, which communicates to the recipient that the connection should still be kept open. Some apps running in our foundry have keepalives configured to be less than the 15s default probe interval. Enabling connection keep alives in our foundry caused gorouter to return 502 (backend-endpoint-failed) for approximately ~0.01% of requests to backends which had short keepalive durations. It did not appear to have a substantial effect on apps with longer keepalive durations, although I have not conclusively tested this. This only affects requests which cannot be retried (e.g. POST) Now that routing-release (since 0.192) defaults keep-alives to backend connections to be on, it may be sensible to have a shorter probe interval, to avoid these hard to diagnose networking issues. This commit configures the default probe interval of the underlying net.Dialer in the same style as EndpointDialTimeout; it adds a new configuration parameter EndpointKeepAliveProbeInterval which configures the underlying net.Dialer KeepAlive parameter, and overrides the default of 15s to be 1s. I chose 1s somewhat arbitrarily, but it has the property of being lower than the 2s default keepalive duration that some of the pathological apps deployed to our CF have. This should not adversely affect the performance of gorouter, however that it will send 1 probe per connection idle connection every one second, instead of 1 probe per connection every 15 seconds. Signed-off-by: toby lorne <toby@toby.codes>

cf-gitbot · 2019-10-27T16:07:14Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/169384608

The labels on this github issue will be updated when the story is started.

tlwr · 2019-10-27T17:45:46Z

CLA bot didn't appear but I have signed both the new (#257) and the old CLA (#251)

KauzClay · 2019-10-28T17:06:05Z

Hi @tlwr ,

Thanks for this PR. The reasoning and code look good to us. However, we would appreciate if you could add a unit test for the config change (similar to this).

Happy to merge afterwards!

Thanks
@KauzClay and @angelachin

tlwr · 2019-10-28T17:22:42Z

Hi, thanks!

I've added a test that checks that the default is set appropriately.

The test that you have linked to tests that the parsed configuration is set appropriately.

The first commits changes the configuration setting, but does not make it configurable from the config file, in the same way as EndpointDialTimeout which can be configured on the struct only.

I am happy to, as well as the test I have just written, enable it as a configurable option (endpoint_keep_alive_probe_interval), and write a test that it can also be configured from YAML

:)

Edit: I see no reason to not make it configurable although a separate PR will have to be raised in routing-release. I'll make the change...

Edit of edit: Done, once this is merged shall I raise a corresponding PR in routing-release to allow it to be configured, or omit it for now (like max_idle_conns_per_host)?

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

KauzClay · 2019-10-28T17:36:37Z

@tlwr Wow we were just drafting a response to say yes, configurable sounds like a good move, but you did it anyway! Thanks!

KauzClay · 2019-10-28T18:00:26Z

@tlwr we merged. We will commit that into routing release so it points to a version of gorouter with your change, so it will be there when you PR routing-release

Edit: routing-release develop now points to your gorouter changes! cloudfoundry/routing-release@a5eb78a

tlwr · 2019-10-28T18:02:33Z

We will commit that into routing release so it points to a version of gorouter with your change, so it will be there when you PR routing-release

Awesome, I'll see you in a PR in routing-release 👋 🏃

In cloudfoundry/gorouter#258 we customised and made configurable the endpoint_keep_alive_probe_interval property in gorouter. This commit allows the user to configure the endpoint_keep_alive_probe_interval using the router.keep_alive_probe_interval property, and adds a default so documentation is clear. In order to confirm that my logic was correct I added some tests Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

cf-gitbot added the unscheduled label Oct 27, 2019

Toby Lorne added 2 commits October 28, 2019 17:34

Add keepalive probe interval default config test

3386609

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Allow keep alive probe interval to be configured

b44b55a

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

tlwr force-pushed the set-keepalive-timeout branch from a5cdfb8 to b44b55a Compare October 28, 2019 17:35

KauzClay merged commit 8339efb into cloudfoundry:master Oct 28, 2019

cf-gitbot removed the unscheduled label Oct 28, 2019

tlwr deleted the set-keepalive-timeout branch October 28, 2019 18:01

tlwr mentioned this pull request Oct 28, 2019

Configure gorouter keep alive probe interval cloudfoundry/routing-release#159

Merged

5 tasks

cf-gitbot added delivered accepted and removed delivered labels Oct 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set proxy net.Dial KeepAlive to 1 second #258

Set proxy net.Dial KeepAlive to 1 second #258

tlwr commented Oct 27, 2019 •

edited

Loading

cf-gitbot commented Oct 27, 2019

tlwr commented Oct 27, 2019

KauzClay commented Oct 28, 2019

tlwr commented Oct 28, 2019 •

edited

Loading

KauzClay commented Oct 28, 2019

KauzClay commented Oct 28, 2019 •

edited

Loading

tlwr commented Oct 28, 2019

Set proxy net.Dial KeepAlive to 1 second #258

Set proxy net.Dial KeepAlive to 1 second #258

Conversation

tlwr commented Oct 27, 2019 • edited Loading

A short explanation of the proposed change:

An explanation of the use cases your change solves

Instructions to functionally test the behavior change

Expected result after the change

Current result before the change

Links to any other associated PRs

cf-gitbot commented Oct 27, 2019

tlwr commented Oct 27, 2019

KauzClay commented Oct 28, 2019

tlwr commented Oct 28, 2019 • edited Loading

KauzClay commented Oct 28, 2019

KauzClay commented Oct 28, 2019 • edited Loading

tlwr commented Oct 28, 2019

tlwr commented Oct 27, 2019 •

edited

Loading

tlwr commented Oct 28, 2019 •

edited

Loading

KauzClay commented Oct 28, 2019 •

edited

Loading