Added code to track tainting of a tunnel. #155

cheftako · 2020-10-24T01:02:11Z

Taint lasts for the tcp keepalive period.
Taint code only works if client is remembering to properly close its
connection.
Tested using ifconfig down to break the connection and prevent tcp close
from the OS. (kill -9 and similar do not work).
This does NOT close tunnels but instead relies on the already
implemented TCP keepalive for that functionality.

Taint lasts for the tcp keepalive period. Taint code only works if client is remembering to properly close its connection. Tested using ifconfig down to break the connection and prevent tcp close from the OS. (kill -9 and similar do not work). This does NOT close tunnels but instead relies on the already implemented TCP keepalive for that functionality.

k8s-ci-robot · 2020-10-24T01:02:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheftako]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cheftako · 2020-10-24T01:03:34Z

/hold a) to add a CLI flag to enable/disable the behavior and b) to let #144 merge first. (I should take care of merge conflicts)

cheftako · 2020-10-24T01:03:51Z

/assign @caesarxuchao

Adding locks which were documented but missing.

caesarxuchao

Walter, can you remind me what problem this tainting mechanism is trying to solve? IIUC, it tries to avoid picking a backend if it has recently failed. If so, the benefit is saving end users of the network proxy from waiting for the server to detect a GRPC connection to the backend is broken, which could take up to the keepAlive seconds.

The tainting mechanism is quite complex. I don't know if the benefit is worth the complexity.

caesarxuchao · 2020-10-28T06:19:29Z

pkg/server/backend_manager.go

 }

 // BackendStorage is an interface to manage the storage of the backend
 // connections, i.e., get, add and remove
 type BackendStorage interface {
 	// AddBackend adds a backend.
 	AddBackend(agentID string, conn agent.AgentService_ConnectServer) Backend
+	// TaintBackend indicates an error occurred on a backend and allows to BackendManager to act based on the error.


"allows the* BackendManager..."?

caesarxuchao · 2020-10-28T06:21:08Z

pkg/server/backend_manager.go

@@ -76,7 +87,7 @@ type BackendManager interface {
 	// context instead of a request-scoped context, as the backend manager will
 	// pick a backend for every tunnel session and each tunnel session may
 	// contains multiple requests.
-	Backend(ctx context.Context) (Backend, error)
+	Backend(ctx context.Context) (Backend, string, error)


Can you update the comment saying what the returned the string is?

caesarxuchao · 2020-10-28T06:23:28Z

pkg/server/backend_manager.go

+	// in the Backend() method. There is no reliable way to randomly
+	// pick a key from a map (in this case, the backends) in Golang.
+	// Consider switching this to a map[string]bool to get set behavior
+	// A little nasty as length where true is yuck.


Sorry, I can't follow the last two lines of the comment, can you explain more?

Golang has no set function, but having one would make things a little easier (prevent issues with readding existing values and throwing of the count). The standard workaround it to use a map[string]bool to get that functionality. We a set length function which would be equivalent to counting the number of keys where the bool is true. Very doable but I decided it was a nice to have right now.

caesarxuchao · 2020-10-28T07:06:02Z

pkg/server/server.go

+					if frontStream.getStatus() != Closing {
+						klog.ErrorS(err, "Stream read from frontend failure",
+							"agentID", agentID)
+						s.BackendManager.TaintBackend(agentID, err)


Why do we taint the backend when the frontend stream read is broken?

Sadly if the lose the backend to something like a machine crash or network outage we get no error on the backend. Essentially we send TCP packets which disappear into the ether. TCP timeout or even keepalive is generally in minutes and the frontend request timeout is in seconds. So the error shows up as a front ends error (really a timeout but to us it looks like a context closed error). We can't be sure what is wrong, so its better to "taint" the connection than close it. We need to rely on the keepalive options others have added to take care of that. However keepalive can be slow. This helps us to not send requests to a dead connection while we are determining if the connection is actually dead.

Does this mean hitting any client side timeout results in a backend taint? I've encountered timeouts on the k/k client side and sometimes it's just the destination service being overloaded and taking a while to respond. This happens a decent amount in the k/k e2e tests and the retry mechanism for most clients takes care of the problem.

I'm worried a problematic destination (eg: webhook) could cause all backends to be tainted for a short period of time.

All backends getting tainted just reverts us to the current behavior. We lose the benefits of tainting but the system continues to work.

We can detect broken agent by observing if a dial response is returned by the agent on time. That's less prone to false-positive.

Jefftree · 2020-10-28T23:11:23Z

cmd/client/main.go

@@ -250,6 +250,7 @@ func (c *Client) run(o *GrpcProxyClientOptions) error {
 			time.Sleep(wait)
 		}
 	}
+	client.CloseIdleConnections()


What happens if client idle connections are not closed?

In my tests it causes us to incorrectly taint tunnels. We then systematically work our way through the list of tunnels/backends until they are all tainted. At that point the code works the way it does today. So its not worse than today but it does mean your not getting any benefit out of the tainting tunnels/backends code.

The default IdleConnTimeout for golang's http library is 90 seconds, and almost all clients in k/k keep that parameter untouched.

I'm surprised that we get incorrectly tainted tunnels rather than just a really delayed CLOSE_REQ.

90 seconds is a really long time. With our test client (and no CloseIdleConnections call) I was seeing a context closed error on the front end connection after <10 seconds. That may just be the OS cleaning up the connection once the process went away.

That may just be the OS cleaning up the connection once the process went away.

Ahh I think that's exactly it. I remember adding a sleep interval at the end of of the client previously and it gracefully closed the connection after 90s was hit.

This line addition makes sense but we shouldn't make this assumption for all our clients. It's unlikely for the kube-apiserver process to be terminated so we probably wouldn't run into these context closed errors in k/k too often. However with the 90s idle time, that means the earliest time we'd be able to detect a problem on a connection is also 90s, which is already in the magnitude of minutes.

I'd be interested in seeing that happens with webhook calls which have <10 second http timeouts.

kubernetes/kubernetes#95981 would reduce that in general to 45s. (Still not great). I am also curious to see the behavior with webhook calls as they generally had a sub 10s http timeout.

fejta-bot · 2021-02-04T22:44:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2021-02-05T01:08:23Z

@cheftako: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jefftree · 2021-02-09T22:11:11Z

/lifecycle frozen

k8s-triage-robot · 2021-08-18T00:22:35Z

The lifecycle/frozen label can not be applied to PRs.

This bot removes lifecycle/frozen from PRs because:

Commenting /lifecycle frozen on a PR has not worked since March 2021
PRs that remain open for >150 days are unlikely to be easily rebased

You can:

Rebase this PR and attempt to get it merged
Close this PR with /close

Please send feedback to sig-contributor-experience at kubernetes/community.

/remove-lifecycle frozen

k8s-triage-robot · 2021-11-16T01:19:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot · 2021-11-30T03:53:43Z

@cheftako: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-apiserver-network-proxy-make-lint	`67b05e9`	link	`/test pull-apiserver-network-proxy-make-lint`
pull-apiserver-network-proxy-docker-build-amd64	`67b05e9`	link	true	`/test pull-apiserver-network-proxy-docker-build-amd64`
pull-apiserver-network-proxy-docker-build-arm64	`67b05e9`	link	true	`/test pull-apiserver-network-proxy-docker-build-arm64`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-triage-robot · 2021-12-30T04:34:47Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-01-29T05:09:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-01-29T05:10:10Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2020

k8s-ci-robot requested review from Jefftree and Sh4d1 October 24, 2020 01:02

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 24, 2020

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2020

k8s-ci-robot assigned caesarxuchao Oct 24, 2020

Fixing test failures.

67b05e9

Adding locks which were documented but missing.

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 27, 2020

caesarxuchao reviewed Oct 28, 2020

View reviewed changes

Jefftree reviewed Oct 28, 2020

View reviewed changes

Jefftree mentioned this pull request Oct 28, 2020

Feature: implement the BackendManager list #144

Merged

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 4, 2021

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2021

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 18, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2021

k8s-ci-robot closed this Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added code to track tainting of a tunnel. #155

Added code to track tainting of a tunnel. #155

cheftako commented Oct 24, 2020

k8s-ci-robot commented Oct 24, 2020

cheftako commented Oct 24, 2020

cheftako commented Oct 24, 2020

caesarxuchao left a comment

caesarxuchao Oct 28, 2020

caesarxuchao Oct 28, 2020

caesarxuchao Oct 28, 2020

cheftako Oct 28, 2020

caesarxuchao Oct 28, 2020

cheftako Oct 28, 2020

Jefftree Oct 29, 2020 •

edited

Loading

cheftako Oct 29, 2020

caesarxuchao Nov 6, 2020

Jefftree Oct 28, 2020

cheftako Oct 29, 2020 •

edited

Loading

Jefftree Oct 29, 2020

cheftako Oct 29, 2020

Jefftree Oct 29, 2020

cheftako Oct 29, 2020

cheftako Oct 29, 2020

fejta-bot commented Feb 4, 2021

k8s-ci-robot commented Feb 5, 2021

Jefftree commented Feb 9, 2021

k8s-triage-robot commented Aug 18, 2021

k8s-triage-robot commented Nov 16, 2021

k8s-ci-robot commented Nov 30, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-ci-robot commented Jan 29, 2022

Added code to track tainting of a tunnel. #155

Added code to track tainting of a tunnel. #155

Conversation

cheftako commented Oct 24, 2020

k8s-ci-robot commented Oct 24, 2020

cheftako commented Oct 24, 2020

cheftako commented Oct 24, 2020

caesarxuchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jefftree Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheftako Oct 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fejta-bot commented Feb 4, 2021

k8s-ci-robot commented Feb 5, 2021

Jefftree commented Feb 9, 2021

k8s-triage-robot commented Aug 18, 2021

k8s-triage-robot commented Nov 16, 2021

k8s-ci-robot commented Nov 30, 2021

k8s-triage-robot commented Dec 30, 2021

k8s-triage-robot commented Jan 29, 2022

k8s-ci-robot commented Jan 29, 2022

Jefftree Oct 29, 2020 •

edited

Loading

cheftako Oct 29, 2020 •

edited

Loading