Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart port forwarding on failure #6013

Conversation

feloy
Copy link
Contributor

@feloy feloy commented Aug 8, 2022

What type of PR is this:

/kind bug

What does this PR do / why we need it:

Which issue(s) this PR fixes:

Fixes #5877

PR acceptance criteria:

  • Unit test

  • Integration test

  • Documentation

How to test changes / Special notes to the reviewer:

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. Required by Prow. kind/bug Categorizes issue or PR as related to a bug. labels Aug 8, 2022
@netlify
Copy link

netlify bot commented Aug 8, 2022

Deploy Preview for odo-docusaurus-preview canceled.

Name Link
🔨 Latest commit 639bf34
🔍 Latest deploy log https://app.netlify.com/sites/odo-docusaurus-preview/deploys/62f49eea835db20008038c1c

@openshift-ci openshift-ci bot requested review from cdrage and dharmit August 8, 2022 14:13
@odo-robot
Copy link

odo-robot bot commented Aug 8, 2022

Unit Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

@odo-robot
Copy link

odo-robot bot commented Aug 8, 2022

Validate Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

@odo-robot
Copy link

odo-robot bot commented Aug 8, 2022

Windows Tests (OCP) on commit 5048cd7 finished with errors.
View logs: TXT HTML

@odo-robot
Copy link

odo-robot bot commented Aug 8, 2022

Kubernetes Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

@odo-robot
Copy link

odo-robot bot commented Aug 8, 2022

OpenShift Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

@dharmit
Copy link
Member

dharmit commented Aug 9, 2022

I was having a conversation about this with Tomas on Friday. The way I patched the PR #5933 was by adding one more check to the if condition:

diff --git a/pkg/portForward/portForward.go b/pkg/portForward/portForward.go
index 7101fc2fb..fc900d573 100644
--- a/pkg/portForward/portForward.go
+++ b/pkg/portForward/portForward.go
@@ -49,7 +49,7 @@ func (o *PFClient) StartPortForwarding(
                return err
        }
 
-       if o.stopChan != nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
+       if o.stopChan != nil && o.finishedChan == nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
                return nil
        }

The problem I was then trying to solve was user doing curl while odo dev is syncing a change made by the user to any of the source files. Not the first odo dev, but a subsequent run of it. I figured that if the user curl's while odo dev is still syncing, the goroutine that's handling port-forwarding is exited. And since the stopChan is not nil and endpoints haven't changed, the if condition would result in return'ing before it could start port-forwarding again.

@feloy feloy force-pushed the bugfix-5877/restart-port-forward-on-failure branch from 1ea5d14 to a8815a9 Compare August 9, 2022 14:27
@feloy feloy changed the title [wip] Restart port forwarding on failure Restart port forwarding on failure Aug 9, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. Required by Prow. label Aug 9, 2022
@feloy feloy requested review from rm3l and removed request for cdrage August 9, 2022 15:29
@feloy
Copy link
Contributor Author

feloy commented Aug 9, 2022

I was having a conversation about this with Tomas on Friday. The way I patched the PR #5933 was by adding one more check to the if condition:

diff --git a/pkg/portForward/portForward.go b/pkg/portForward/portForward.go
index 7101fc2fb..fc900d573 100644
--- a/pkg/portForward/portForward.go
+++ b/pkg/portForward/portForward.go
@@ -49,7 +49,7 @@ func (o *PFClient) StartPortForwarding(
                return err
        }
 
-       if o.stopChan != nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
+       if o.stopChan != nil && o.finishedChan == nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
                return nil
        }

The problem I was then trying to solve was user doing curl while odo dev is syncing a change made by the user to any of the source files. Not the first odo dev, but a subsequent run of it. I figured that if the user curl's while odo dev is still syncing, the goroutine that's handling port-forwarding is exited. And since the stopChan is not nil and endpoints haven't changed, the if condition would result in return'ing before it could start port-forwarding again.

I don't think this change is necessary with the changes I have made on my side.

@feloy feloy closed this Aug 9, 2022
@feloy feloy reopened this Aug 9, 2022
@feloy feloy force-pushed the bugfix-5877/restart-port-forward-on-failure branch from a8815a9 to 93b1a83 Compare August 9, 2022 15:49
Copy link
Member

@dharmit dharmit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions where I didn't understand why we are doing what we are doing.

A BIG THANK YOU for adding comments. It made reading code between the goroutines and StopPortForwarding easier.

Comment on lines 85 to 86
o.originalErrorHandlers = runtime.ErrorHandlers
runtime.ErrorHandlers = append(runtime.ErrorHandlers, func(err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? At this place, we are adding a check to perform some action if the error is "lost connection to pod".
But why are we doing runtime.ErrorHandlers = o.originalErrorHandlers in the StopForwarding method and not here? And what is the significance of this in StopForwarding method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime.ErrorHandlers is a generic error handler defined in the API Machinery library (https://github.com/kubernetes/apimachinery/blob/master/pkg/util/runtime/runtime.go).

It is used by the client-go library, specifically in the forward function, through the call to runtime.HandleError:

// forward dials the remote host specific in req, upgrades the request, starts
// listeners for each port specified in ports, and forwards local connections
// to the remote host via streams.
func (pf *PortForwarder) forward() error {
[...]
	// wait for interrupt or conn closure
	select {
	case <-pf.stopChan:
	case <-pf.streamConn.CloseChan():
		runtime.HandleError(errors.New("lost connection to pod"))
	}

	return nil
}

By default, the ErrorHandlers contains two handlers: 1 for logging the error using klog, and 1 for backing off the errors:

var ErrorHandlers = []func(error){
	logError,
	(&rudimentaryErrorBackoff{
		lastErrorTime: time.Now(),
		// 1ms was the number folks were able to stomach as a global rate limit.
		// If you need to log errors more than 1000 times a second you
		// should probably consider fixing your code instead. :)
		minPeriod: time.Millisecond,
	}).OnError,
}

Here, we are adding a new handler to the list, so our handler can be called when errors occur, and specifically when the "list connection" error happens.

We need to reset the handlers to their original value when we are stopping the port forwarding, or the same handler will be added twice, 3x, 4x, etc when we are restartting the port forward again after a failure.

Comment on lines +93 to +94
o.stopChan <- struct{}{}
o.stopChan = make(chan struct{}, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this? I understand the part about o.stopChan <- struct{}{} is to stop the port-forwarding started using client-go library. But why are we doing o.stopChan = make(chan struct{}, 1) after that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to start with a fresh new channel. The old one will be cleaned by the GC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. But what's the rationale behind it? If it's about unblocking, as you mentioned in another comment, I think we are doing some kind of "trick".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about restarting with fresh new state, because we are restarting a new port forwarding, and don't care about the state of the previous one.

@@ -58,7 +66,6 @@ func (o *PFClient) StartPortForwarding(
o.StopPortForwarding()

o.stopChan = make(chan struct{}, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a buffered channel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a non-buffered one, writing into this channel will block until the someone reads the channel.

Usin a buffered one, the first write into the channel won't block, even if no one reads it.

This way, we can write on the channel and continue, even if the forward function is not yet started and so wion't read the channel.

That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.

Copy link
Member

@dharmit dharmit Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, we can write on the channel and continue, even if the forward function is not yet started and so wion't read the channel.

This makes more sense when I look at the following block, where we are first calling StopForwarding under a certain situation and then calling StartPortForwarding:

if podChanged || portsChanged {
a.portForwardClient.StopPortForwarding()
}
err = a.portForwardClient.StartPortForwarding(a.Devfile, a.ComponentName, parameters.RandomPorts, parameters.ErrOut)

Is it for this scenario that we are using buffered channel here?

That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.

Are you referring to the code:

o.stopChan <- struct{}{}
o.stopChan = make(chan struct{}, 1)

But that defies the purpose of having a channel with buffer size 1, no? If we don't want to be blocked, we should create a larger buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But of which size? If you create it of size n, it will happen once in a while that it will be written n+1 times.

Particularly, I tested the changes by flooding the app with the following command, while modifying source code so the pod restarts:

while : ; do curl localhost:40001; done

With a buffered channel of size 100, I'm pretty sure I would have filled the buffer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For channels sizing, see this recommendation (size of 0 or 1) from the coding conventions doc: https://github.com/redhat-developer/odo/wiki/Dev:-Coding-Conventions#channel-size-is-one-or-none

@valaparthvi
Copy link
Contributor

/approve

@openshift-ci
Copy link

openshift-ci bot commented Aug 10, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: valaparthvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. Required by Prow. label Aug 10, 2022
@dharmit
Copy link
Member

dharmit commented Aug 10, 2022

/lgtm

@feloy
Copy link
Contributor Author

feloy commented Aug 10, 2022

Openshift IBM Cloud tests:

odo dev command tests
/go/odo_1/tests/integration/cmd_dev_test.go:29
  devfile contains composite apply command
  /go/odo_1/tests/integration/cmd_dev_test.go:1197
    when odo dev is running
    /go/odo_1/tests/integration/cmd_dev_test.go:1209
      [It] should execute the composite apply commands successfully
      /go/odo_1/tests/integration/cmd_dev_test.go:1219

    [odo] Error: Cannot find module 'prom-client'
    [odo] Require stack:
    [odo] - /projects/server.js
    [odo]     at Function.Module._resolveFilename (internal/modules/cjs/loader.js:902:15)
    [odo]     at Function.Module._load (internal/modules/cjs/loader.js:746:27)
    [odo]     at Module.require (internal/modules/cjs/loader.js:974:19)
    [odo]     at require (internal/modules/cjs/helpers.js:93:18)
    [odo]     at Object.<anonymous> (/projects/server.js:1:20)
    [odo]     at Module._compile (internal/modules/cjs/loader.js:1085:14)
    [odo]     at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
    [odo]     at Module.load (internal/modules/cjs/loader.js:950:32)
    [odo]     at Function.Module._load (internal/modules/cjs/loader.js:790:12)
    [odo]     at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12) {
    [odo]   code: 'MODULE_NOT_FOUND',
    [odo]   requireStack: [ '/projects/server.js' ]
    [odo] }

@valaparthvi
I'm seeing this error regularly in Openshift IBM Cloud tests (in this PR but also other PRs). I've made this commit to try and fix it: 4d23968

WDYT?

@feloy
Copy link
Contributor Author

feloy commented Aug 10, 2022

Restart tests to check flakiness of fixed test

@feloy feloy closed this Aug 10, 2022
@feloy feloy reopened this Aug 10, 2022
Copy link
Member

@rm3l rm3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

LGTM overall otherwise.

@feloy
Copy link
Contributor Author

feloy commented Aug 11, 2022

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

I was also inclined to do this, but this would break the integration tests, as we are not guaranteed that the port is not taken by another process during the restart of the port forwarding.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@rm3l
Copy link
Member

rm3l commented Aug 11, 2022

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

I was also inclined to do this, but this would break the integration tests, as we are not guaranteed that the port is not taken by another process during the restart of the port forwarding.

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

Copy link
Member

@rm3l rm3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

@feloy
Copy link
Contributor Author

feloy commented Aug 11, 2022

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

@feloy
Copy link
Contributor Author

feloy commented Aug 11, 2022

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

That's an interesting challenge. I would like to have a try. Either in this PR or another one

@rm3l
Copy link
Member

rm3l commented Aug 11, 2022

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

That's an interesting challenge. I would like to have a try. Either in this PR or another one

As there was already no unit tests when portForward.go was introduced the first time, it is okay to me if unit tests are done in a separate PR (we should focus on the upcoming release candidate).

Copy link
Member

@rm3l rm3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. Required by Prow. label Aug 11, 2022
@feloy
Copy link
Contributor Author

feloy commented Aug 11, 2022

/override ci/prow/v4.10-integration-e2e

@feloy
Copy link
Contributor Author

feloy commented Aug 11, 2022

/override windows-integration-test/Windows-test

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2022

@feloy: Overrode contexts on behalf of feloy: windows-integration-test/Windows-test

In response to this:

/override windows-integration-test/Windows-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2022

@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e, windows-integration-test/Windows-test

In response to this:

/override ci/prow/v4.10-integration-e2e
/override windows-integration-test/Windows-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2022

@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e

In response to this:

/override ci/prow/v4.10-integration-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 53a7c3c into redhat-developer:main Aug 11, 2022
@dharmit
Copy link
Member

dharmit commented Aug 11, 2022

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

I would rather not expose that port flag to the users at all. Opinionated. 😉

@rm3l
Copy link
Member

rm3l commented Aug 11, 2022

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

I would rather not expose that port flag to the users at all. Opinionated. wink

Indeed. If possible to hide it (mostly from the user-facing Help output), it would make more sense not to expose it 👍🏿

feloy added a commit to feloy/odo that referenced this pull request Aug 18, 2022
cdrage pushed a commit to cdrage/odo that referenced this pull request Aug 31, 2022
* Restart port forwarding on failure

* Save ports again when port forward is restarted

* Integration test

* Update pkg/portForward/portForward.go

Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>

* Fix rebase

* Fix integration test with run composite command

* Copy errorhandlers

* Add timeout for first-time port forwarding

Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>
@rm3l rm3l mentioned this pull request Jan 26, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. Required by Prow. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. Required by Prow.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

odo dev port forward timeouts if app is not ready
5 participants