-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart port forwarding on failure #6013
Restart port forwarding on failure #6013
Conversation
✅ Deploy Preview for odo-docusaurus-preview canceled.
|
I was having a conversation about this with Tomas on Friday. The way I patched the PR #5933 was by adding one more check to the diff --git a/pkg/portForward/portForward.go b/pkg/portForward/portForward.go
index 7101fc2fb..fc900d573 100644
--- a/pkg/portForward/portForward.go
+++ b/pkg/portForward/portForward.go
@@ -49,7 +49,7 @@ func (o *PFClient) StartPortForwarding(
return err
}
- if o.stopChan != nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
+ if o.stopChan != nil && o.finishedChan == nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
return nil
} The problem I was then trying to solve was user doing |
1ea5d14
to
a8815a9
Compare
I don't think this change is necessary with the changes I have made on my side. |
a8815a9
to
93b1a83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions where I didn't understand why we are doing what we are doing.
A BIG THANK YOU for adding comments. It made reading code between the goroutines and StopPortForwarding
easier.
pkg/portForward/portForward.go
Outdated
o.originalErrorHandlers = runtime.ErrorHandlers | ||
runtime.ErrorHandlers = append(runtime.ErrorHandlers, func(err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? At this place, we are adding a check to perform some action if the error is "lost connection to pod".
But why are we doing runtime.ErrorHandlers = o.originalErrorHandlers
in the StopForwarding
method and not here? And what is the significance of this in StopForwarding
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime.ErrorHandlers is a generic error handler defined in the API Machinery library (https://github.com/kubernetes/apimachinery/blob/master/pkg/util/runtime/runtime.go).
It is used by the client-go library, specifically in the forward
function, through the call to runtime.HandleError:
// forward dials the remote host specific in req, upgrades the request, starts
// listeners for each port specified in ports, and forwards local connections
// to the remote host via streams.
func (pf *PortForwarder) forward() error {
[...]
// wait for interrupt or conn closure
select {
case <-pf.stopChan:
case <-pf.streamConn.CloseChan():
runtime.HandleError(errors.New("lost connection to pod"))
}
return nil
}
By default, the ErrorHandlers contains two handlers: 1 for logging the error using klog, and 1 for backing off the errors:
var ErrorHandlers = []func(error){
logError,
(&rudimentaryErrorBackoff{
lastErrorTime: time.Now(),
// 1ms was the number folks were able to stomach as a global rate limit.
// If you need to log errors more than 1000 times a second you
// should probably consider fixing your code instead. :)
minPeriod: time.Millisecond,
}).OnError,
}
Here, we are adding a new handler to the list, so our handler can be called when errors occur, and specifically when the "list connection" error happens.
We need to reset the handlers to their original value when we are stopping the port forwarding, or the same handler will be added twice, 3x, 4x, etc when we are restartting the port forward again after a failure.
o.stopChan <- struct{}{} | ||
o.stopChan = make(chan struct{}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this? I understand the part about o.stopChan <- struct{}{}
is to stop the port-forwarding started using client-go library. But why are we doing o.stopChan = make(chan struct{}, 1)
after that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to start with a fresh new channel. The old one will be cleaned by the GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. But what's the rationale behind it? If it's about unblocking, as you mentioned in another comment, I think we are doing some kind of "trick".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about restarting with fresh new state, because we are restarting a new port forwarding, and don't care about the state of the previous one.
@@ -58,7 +66,6 @@ func (o *PFClient) StartPortForwarding( | |||
o.StopPortForwarding() | |||
|
|||
o.stopChan = make(chan struct{}, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a buffered channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a non-buffered one, writing into this channel will block until the someone reads the channel.
Usin a buffered one, the first write into the channel won't block, even if no one reads it.
This way, we can write on the channel and continue, even if the forward
function is not yet started and so wion't read the channel.
That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way, we can write on the channel and continue, even if the
forward
function is not yet started and so wion't read the channel.
This makes more sense when I look at the following block, where we are first calling StopForwarding
under a certain situation and then calling StartPortForwarding
:
odo/pkg/devfile/adapters/kubernetes/component/adapter.go
Lines 313 to 317 in 076f2e2
if podChanged || portsChanged { | |
a.portForwardClient.StopPortForwarding() | |
} | |
err = a.portForwardClient.StartPortForwarding(a.Devfile, a.ComponentName, parameters.RandomPorts, parameters.ErrOut) |
Is it for this scenario that we are using buffered channel here?
That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.
Are you referring to the code:
o.stopChan <- struct{}{}
o.stopChan = make(chan struct{}, 1)
But that defies the purpose of having a channel with buffer size 1, no? If we don't want to be blocked, we should create a larger buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But of which size? If you create it of size n
, it will happen once in a while that it will be written n+1
times.
Particularly, I tested the changes by flooding the app with the following command, while modifying source code so the pod restarts:
while : ; do curl localhost:40001; done
With a buffered channel of size 100, I'm pretty sure I would have filled the buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For channels sizing, see this recommendation (size of 0 or 1) from the coding conventions doc: https://github.com/redhat-developer/odo/wiki/Dev:-Coding-Conventions#channel-size-is-one-or-none
Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: valaparthvi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@valaparthvi WDYT? |
Restart tests to check flakiness of fixed test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
You already mentioned that, but the more I used the --random-ports
flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.
LGTM overall otherwise.
I was also inclined to do this, but this would break the integration tests, as we are not guaranteed that the port is not taken by another process during the restart of the port forwarding. |
Kudos, SonarCloud Quality Gate passed!
|
I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment - do you think unit-testing the methods in portForward.go
could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers
variable we use..
I would prefer to wait and see if people are using this |
That's an interesting challenge. I would like to have a try. Either in this PR or another one |
As there was already no unit tests when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this.
/override ci/prow/v4.10-integration-e2e |
/override windows-integration-test/Windows-test |
@feloy: Overrode contexts on behalf of feloy: windows-integration-test/Windows-test In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e, windows-integration-test/Windows-test In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I would rather not expose that |
Indeed. If possible to hide it (mostly from the user-facing Help output), it would make more sense not to expose it 👍🏿 |
This reverts commit 53a7c3c.
* Restart port forwarding on failure * Save ports again when port forward is restarted * Integration test * Update pkg/portForward/portForward.go Co-authored-by: Dharmit Shah <shahdharmit@gmail.com> * Fix rebase * Fix integration test with run composite command * Copy errorhandlers * Add timeout for first-time port forwarding Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>
What type of PR is this:
/kind bug
What does this PR do / why we need it:
Which issue(s) this PR fixes:
Fixes #5877
PR acceptance criteria:
Unit test
Integration test
Documentation
How to test changes / Special notes to the reviewer: