-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added code to track tainting of a tunnel. #155
Conversation
Taint lasts for the tcp keepalive period. Taint code only works if client is remembering to properly close its connection. Tested using ifconfig down to break the connection and prevent tcp close from the OS. (kill -9 and similar do not work). This does NOT close tunnels but instead relies on the already implemented TCP keepalive for that functionality.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold a) to add a CLI flag to enable/disable the behavior and b) to let #144 merge first. (I should take care of merge conflicts) |
/assign @caesarxuchao |
Adding locks which were documented but missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walter, can you remind me what problem this tainting mechanism is trying to solve? IIUC, it tries to avoid picking a backend if it has recently failed. If so, the benefit is saving end users of the network proxy from waiting for the server to detect a GRPC connection to the backend is broken, which could take up to the keepAlive seconds.
The tainting mechanism is quite complex. I don't know if the benefit is worth the complexity.
} | ||
|
||
// BackendStorage is an interface to manage the storage of the backend | ||
// connections, i.e., get, add and remove | ||
type BackendStorage interface { | ||
// AddBackend adds a backend. | ||
AddBackend(agentID string, conn agent.AgentService_ConnectServer) Backend | ||
// TaintBackend indicates an error occurred on a backend and allows to BackendManager to act based on the error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"allows the* BackendManager..."?
@@ -76,7 +87,7 @@ type BackendManager interface { | |||
// context instead of a request-scoped context, as the backend manager will | |||
// pick a backend for every tunnel session and each tunnel session may | |||
// contains multiple requests. | |||
Backend(ctx context.Context) (Backend, error) | |||
Backend(ctx context.Context) (Backend, string, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comment saying what the returned the string is?
// in the Backend() method. There is no reliable way to randomly | ||
// pick a key from a map (in this case, the backends) in Golang. | ||
// Consider switching this to a map[string]bool to get set behavior | ||
// A little nasty as length where true is yuck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't follow the last two lines of the comment, can you explain more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golang has no set function, but having one would make things a little easier (prevent issues with readding existing values and throwing of the count). The standard workaround it to use a map[string]bool to get that functionality. We a set length function which would be equivalent to counting the number of keys where the bool is true. Very doable but I decided it was a nice to have right now.
if frontStream.getStatus() != Closing { | ||
klog.ErrorS(err, "Stream read from frontend failure", | ||
"agentID", agentID) | ||
s.BackendManager.TaintBackend(agentID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we taint the backend when the frontend stream read is broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly if the lose the backend to something like a machine crash or network outage we get no error on the backend. Essentially we send TCP packets which disappear into the ether. TCP timeout or even keepalive is generally in minutes and the frontend request timeout is in seconds. So the error shows up as a front ends error (really a timeout but to us it looks like a context closed error). We can't be sure what is wrong, so its better to "taint" the connection than close it. We need to rely on the keepalive options others have added to take care of that. However keepalive can be slow. This helps us to not send requests to a dead connection while we are determining if the connection is actually dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean hitting any client side timeout results in a backend taint? I've encountered timeouts on the k/k client side and sometimes it's just the destination service being overloaded and taking a while to respond. This happens a decent amount in the k/k e2e tests and the retry mechanism for most clients takes care of the problem.
I'm worried a problematic destination (eg: webhook) could cause all backends to be tainted for a short period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All backends getting tainted just reverts us to the current behavior. We lose the benefits of tainting but the system continues to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can detect broken agent by observing if a dial response is returned by the agent on time. That's less prone to false-positive.
@@ -250,6 +250,7 @@ func (c *Client) run(o *GrpcProxyClientOptions) error { | |||
time.Sleep(wait) | |||
} | |||
} | |||
client.CloseIdleConnections() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if client idle connections are not closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my tests it causes us to incorrectly taint tunnels. We then systematically work our way through the list of tunnels/backends until they are all tainted. At that point the code works the way it does today. So its not worse than today but it does mean your not getting any benefit out of the tainting tunnels/backends code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default IdleConnTimeout for golang's http library is 90 seconds, and almost all clients in k/k keep that parameter untouched.
I'm surprised that we get incorrectly tainted tunnels rather than just a really delayed CLOSE_REQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
90 seconds is a really long time. With our test client (and no CloseIdleConnections call) I was seeing a context closed error on the front end connection after <10 seconds. That may just be the OS cleaning up the connection once the process went away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may just be the OS cleaning up the connection once the process went away.
Ahh I think that's exactly it. I remember adding a sleep interval at the end of of the client previously and it gracefully closed the connection after 90s was hit.
This line addition makes sense but we shouldn't make this assumption for all our clients. It's unlikely for the kube-apiserver process to be terminated so we probably wouldn't run into these context closed errors in k/k too often. However with the 90s idle time, that means the earliest time we'd be able to detect a problem on a connection is also 90s, which is already in the magnitude of minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in seeing that happens with webhook calls which have <10 second http timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#95981 would reduce that in general to 45s. (Still not great). I am also curious to see the behavior with webhook calls as they generally had a sub 10s http timeout.
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@cheftako: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lifecycle frozen |
The This bot removes
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /remove-lifecycle frozen |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@cheftako: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Taint lasts for the tcp keepalive period.
Taint code only works if client is remembering to properly close its
connection.
Tested using ifconfig down to break the connection and prevent tcp close
from the OS. (kill -9 and similar do not work).
This does NOT close tunnels but instead relies on the already
implemented TCP keepalive for that functionality.