Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added code to track tainting of a tunnel. #155

Closed
wants to merge 2 commits into from

Conversation

cheftako
Copy link
Contributor

Taint lasts for the tcp keepalive period.
Taint code only works if client is remembering to properly close its
connection.
Tested using ifconfig down to break the connection and prevent tcp close
from the OS. (kill -9 and similar do not work).
This does NOT close tunnels but instead relies on the already
implemented TCP keepalive for that functionality.

Taint lasts for the tcp keepalive period.
Taint code only works if client is remembering to properly close its
connection.
Tested using ifconfig down to break the connection and prevent tcp close
from the OS. (kill -9 and similar do not work).
This does NOT close tunnels but instead relies on the already
implemented TCP keepalive for that functionality.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 24, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 24, 2020
@cheftako
Copy link
Contributor Author

/hold a) to add a CLI flag to enable/disable the behavior and b) to let #144 merge first. (I should take care of merge conflicts)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2020
@cheftako
Copy link
Contributor Author

/assign @caesarxuchao

Adding locks which were documented but missing.
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 27, 2020
Copy link
Member

@caesarxuchao caesarxuchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walter, can you remind me what problem this tainting mechanism is trying to solve? IIUC, it tries to avoid picking a backend if it has recently failed. If so, the benefit is saving end users of the network proxy from waiting for the server to detect a GRPC connection to the backend is broken, which could take up to the keepAlive seconds.

The tainting mechanism is quite complex. I don't know if the benefit is worth the complexity.

}

// BackendStorage is an interface to manage the storage of the backend
// connections, i.e., get, add and remove
type BackendStorage interface {
// AddBackend adds a backend.
AddBackend(agentID string, conn agent.AgentService_ConnectServer) Backend
// TaintBackend indicates an error occurred on a backend and allows to BackendManager to act based on the error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"allows the* BackendManager..."?

@@ -76,7 +87,7 @@ type BackendManager interface {
// context instead of a request-scoped context, as the backend manager will
// pick a backend for every tunnel session and each tunnel session may
// contains multiple requests.
Backend(ctx context.Context) (Backend, error)
Backend(ctx context.Context) (Backend, string, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment saying what the returned the string is?

// in the Backend() method. There is no reliable way to randomly
// pick a key from a map (in this case, the backends) in Golang.
// Consider switching this to a map[string]bool to get set behavior
// A little nasty as length where true is yuck.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't follow the last two lines of the comment, can you explain more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golang has no set function, but having one would make things a little easier (prevent issues with readding existing values and throwing of the count). The standard workaround it to use a map[string]bool to get that functionality. We a set length function which would be equivalent to counting the number of keys where the bool is true. Very doable but I decided it was a nice to have right now.

if frontStream.getStatus() != Closing {
klog.ErrorS(err, "Stream read from frontend failure",
"agentID", agentID)
s.BackendManager.TaintBackend(agentID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we taint the backend when the frontend stream read is broken?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly if the lose the backend to something like a machine crash or network outage we get no error on the backend. Essentially we send TCP packets which disappear into the ether. TCP timeout or even keepalive is generally in minutes and the frontend request timeout is in seconds. So the error shows up as a front ends error (really a timeout but to us it looks like a context closed error). We can't be sure what is wrong, so its better to "taint" the connection than close it. We need to rely on the keepalive options others have added to take care of that. However keepalive can be slow. This helps us to not send requests to a dead connection while we are determining if the connection is actually dead.

Copy link
Member

@Jefftree Jefftree Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean hitting any client side timeout results in a backend taint? I've encountered timeouts on the k/k client side and sometimes it's just the destination service being overloaded and taking a while to respond. This happens a decent amount in the k/k e2e tests and the retry mechanism for most clients takes care of the problem.

I'm worried a problematic destination (eg: webhook) could cause all backends to be tainted for a short period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All backends getting tainted just reverts us to the current behavior. We lose the benefits of tainting but the system continues to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can detect broken agent by observing if a dial response is returned by the agent on time. That's less prone to false-positive.

@@ -250,6 +250,7 @@ func (c *Client) run(o *GrpcProxyClientOptions) error {
time.Sleep(wait)
}
}
client.CloseIdleConnections()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if client idle connections are not closed?

Copy link
Contributor Author

@cheftako cheftako Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my tests it causes us to incorrectly taint tunnels. We then systematically work our way through the list of tunnels/backends until they are all tainted. At that point the code works the way it does today. So its not worse than today but it does mean your not getting any benefit out of the tainting tunnels/backends code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default IdleConnTimeout for golang's http library is 90 seconds, and almost all clients in k/k keep that parameter untouched.

I'm surprised that we get incorrectly tainted tunnels rather than just a really delayed CLOSE_REQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

90 seconds is a really long time. With our test client (and no CloseIdleConnections call) I was seeing a context closed error on the front end connection after <10 seconds. That may just be the OS cleaning up the connection once the process went away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may just be the OS cleaning up the connection once the process went away.

Ahh I think that's exactly it. I remember adding a sleep interval at the end of of the client previously and it gracefully closed the connection after 90s was hit.

This line addition makes sense but we shouldn't make this assumption for all our clients. It's unlikely for the kube-apiserver process to be terminated so we probably wouldn't run into these context closed errors in k/k too often. However with the 90s idle time, that means the earliest time we'd be able to detect a problem on a connection is also 90s, which is already in the magnitude of minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested in seeing that happens with webhook calls which have <10 second http timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#95981 would reduce that in general to 45s. (Still not great). I am also curious to see the behavior with webhook calls as they generally had a sub 10s http timeout.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 4, 2021
@k8s-ci-robot
Copy link
Contributor

@cheftako: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jefftree
Copy link
Member

Jefftree commented Feb 9, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2021
@k8s-triage-robot
Copy link

The lifecycle/frozen label can not be applied to PRs.

This bot removes lifecycle/frozen from PRs because:

  • Commenting /lifecycle frozen on a PR has not worked since March 2021
  • PRs that remain open for >150 days are unlikely to be easily rebased

You can:

  • Rebase this PR and attempt to get it merged
  • Close this PR with /close

Please send feedback to sig-contributor-experience at kubernetes/community.

/remove-lifecycle frozen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 18, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2021
@k8s-ci-robot
Copy link
Contributor

@cheftako: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-apiserver-network-proxy-make-lint 67b05e9 link /test pull-apiserver-network-proxy-make-lint
pull-apiserver-network-proxy-docker-build-amd64 67b05e9 link true /test pull-apiserver-network-proxy-docker-build-amd64
pull-apiserver-network-proxy-docker-build-arm64 67b05e9 link true /test pull-apiserver-network-proxy-docker-build-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants