Restart port forwarding on failure #6013

feloy · 2022-08-08T14:12:56Z

What type of PR is this:

/kind bug

What does this PR do / why we need it:

Which issue(s) this PR fixes:

Fixes #5877

PR acceptance criteria:

Unit test
Integration test
Documentation

How to test changes / Special notes to the reviewer:

netlify · 2022-08-08T14:13:11Z

✅ Deploy Preview for odo-docusaurus-preview canceled.

Name	Link
🔨 Latest commit	`639bf34`
🔍 Latest deploy log	https://app.netlify.com/sites/odo-docusaurus-preview/deploys/62f49eea835db20008038c1c

odo-robot · 2022-08-08T14:20:17Z

Unit Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

odo-robot · 2022-08-08T14:28:10Z

Validate Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

odo-robot · 2022-08-08T14:29:35Z

Windows Tests (OCP) on commit 5048cd7 finished with errors.
View logs: TXT HTML

odo-robot · 2022-08-08T14:30:47Z

Kubernetes Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

odo-robot · 2022-08-08T14:32:09Z

OpenShift Tests on commit 5048cd7 finished successfully.
View logs: TXT HTML

dharmit · 2022-08-09T07:07:06Z

I was having a conversation about this with Tomas on Friday. The way I patched the PR #5933 was by adding one more check to the if condition:

diff --git a/pkg/portForward/portForward.go b/pkg/portForward/portForward.go
index 7101fc2fb..fc900d573 100644
--- a/pkg/portForward/portForward.go
+++ b/pkg/portForward/portForward.go
@@ -49,7 +49,7 @@ func (o *PFClient) StartPortForwarding(
                return err
        }
 
-       if o.stopChan != nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
+       if o.stopChan != nil && o.finishedChan == nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
                return nil
        }

The problem I was then trying to solve was user doing curl while odo dev is syncing a change made by the user to any of the source files. Not the first odo dev, but a subsequent run of it. I figured that if the user curl's while odo dev is still syncing, the goroutine that's handling port-forwarding is exited. And since the stopChan is not nil and endpoints haven't changed, the if condition would result in return'ing before it could start port-forwarding again.

feloy · 2022-08-09T15:29:59Z

I was having a conversation about this with Tomas on Friday. The way I patched the PR #5933 was by adding one more check to the if condition:
diff --git a/pkg/portForward/portForward.go b/pkg/portForward/portForward.go
index 7101fc2fb..fc900d573 100644
--- a/pkg/portForward/portForward.go
+++ b/pkg/portForward/portForward.go
@@ -49,7 +49,7 @@ func (o *PFClient) StartPortForwarding(
                return err
        }
 
-       if o.stopChan != nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
+       if o.stopChan != nil && o.finishedChan == nil && reflect.DeepEqual(ceMapping, o.appliedEndpoints) {
                return nil
        }
The problem I was then trying to solve was user doing curl while odo dev is syncing a change made by the user to any of the source files. Not the first odo dev, but a subsequent run of it. I figured that if the user curl's while odo dev is still syncing, the goroutine that's handling port-forwarding is exited. And since the stopChan is not nil and endpoints haven't changed, the if condition would result in return'ing before it could start port-forwarding again.

I don't think this change is necessary with the changes I have made on my side.

dharmit

Mostly questions where I didn't understand why we are doing what we are doing.

A BIG THANK YOU for adding comments. It made reading code between the goroutines and StopPortForwarding easier.

pkg/portForward/portForward.go

dharmit · 2022-08-10T06:57:21Z

pkg/portForward/portForward.go

+	o.originalErrorHandlers = runtime.ErrorHandlers
+	runtime.ErrorHandlers = append(runtime.ErrorHandlers, func(err error) {


How does this work? At this place, we are adding a check to perform some action if the error is "lost connection to pod".
But why are we doing runtime.ErrorHandlers = o.originalErrorHandlers in the StopForwarding method and not here? And what is the significance of this in StopForwarding method?

runtime.ErrorHandlers is a generic error handler defined in the API Machinery library (https://github.com/kubernetes/apimachinery/blob/master/pkg/util/runtime/runtime.go).

It is used by the client-go library, specifically in the forward function, through the call to runtime.HandleError:

// forward dials the remote host specific in req, upgrades the request, starts // listeners for each port specified in ports, and forwards local connections // to the remote host via streams. func (pf *PortForwarder) forward() error { [...] // wait for interrupt or conn closure select { case <-pf.stopChan: case <-pf.streamConn.CloseChan(): runtime.HandleError(errors.New("lost connection to pod")) } return nil }

By default, the ErrorHandlers contains two handlers: 1 for logging the error using klog, and 1 for backing off the errors:

var ErrorHandlers = []func(error){ logError, (&rudimentaryErrorBackoff{ lastErrorTime: time.Now(), // 1ms was the number folks were able to stomach as a global rate limit. // If you need to log errors more than 1000 times a second you // should probably consider fixing your code instead. :) minPeriod: time.Millisecond, }).OnError, }

Here, we are adding a new handler to the list, so our handler can be called when errors occur, and specifically when the "list connection" error happens.

We need to reset the handlers to their original value when we are stopping the port forwarding, or the same handler will be added twice, 3x, 4x, etc when we are restartting the port forward again after a failure.

dharmit · 2022-08-10T07:03:53Z

pkg/portForward/portForward.go

+			o.stopChan <- struct{}{}
+			o.stopChan = make(chan struct{}, 1)


Why are we doing this? I understand the part about o.stopChan <- struct{}{} is to stop the port-forwarding started using client-go library. But why are we doing o.stopChan = make(chan struct{}, 1) after that?

I prefer to start with a fresh new channel. The old one will be cleaned by the GC.

OK. But what's the rationale behind it? If it's about unblocking, as you mentioned in another comment, I think we are doing some kind of "trick".

It's about restarting with fresh new state, because we are restarting a new port forwarding, and don't care about the state of the previous one.

dharmit · 2022-08-10T07:10:47Z

pkg/portForward/portForward.go

@@ -58,7 +66,6 @@ func (o *PFClient) StartPortForwarding(
 	o.StopPortForwarding()

 	o.stopChan = make(chan struct{}, 1)


Why is this a buffered channel?

Using a non-buffered one, writing into this channel will block until the someone reads the channel.

Usin a buffered one, the first write into the channel won't block, even if no one reads it.

This way, we can write on the channel and continue, even if the forward function is not yet started and so wion't read the channel.

That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.

This way, we can write on the channel and continue, even if the forward function is not yet started and so wion't read the channel.

This makes more sense when I look at the following block, where we are first calling StopForwarding under a certain situation and then calling StartPortForwarding:

odo/pkg/devfile/adapters/kubernetes/component/adapter.go

Lines 313 to 317 in 076f2e2

if podChanged || portsChanged {

a.portForwardClient.StopPortForwarding()

}

err = a.portForwardClient.StartPortForwarding(a.Devfile, a.ComponentName, parameters.RandomPorts, parameters.ErrOut)

Is it for this scenario that we are using buffered channel here?

That's also why I'm recreating a fresh new channel after, so if the channel has not been read, we won't be blocked during the second write.

Are you referring to the code:

o.stopChan <- struct{}{} o.stopChan = make(chan struct{}, 1)

But that defies the purpose of having a channel with buffer size 1, no? If we don't want to be blocked, we should create a larger buffer.

But of which size? If you create it of size n, it will happen once in a while that it will be written n+1 times.

Particularly, I tested the changes by flooding the app with the following command, while modifying source code so the pod restarts:

while : ; do curl localhost:40001; done

With a buffered channel of size 100, I'm pretty sure I would have filled the buffer

For channels sizing, see this recommendation (size of 0 or 1) from the coding conventions doc: https://github.com/redhat-developer/odo/wiki/Dev:-Coding-Conventions#channel-size-is-one-or-none

tests/integration/cmd_dev_test.go

Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>

valaparthvi · 2022-08-10T11:05:18Z

/approve

openshift-ci · 2022-08-10T11:05:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: valaparthvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [valaparthvi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dharmit · 2022-08-10T12:29:36Z

/lgtm

feloy · 2022-08-10T14:24:49Z

Openshift IBM Cloud tests:

odo dev command tests
/go/odo_1/tests/integration/cmd_dev_test.go:29
  devfile contains composite apply command
  /go/odo_1/tests/integration/cmd_dev_test.go:1197
    when odo dev is running
    /go/odo_1/tests/integration/cmd_dev_test.go:1209
      [It] should execute the composite apply commands successfully
      /go/odo_1/tests/integration/cmd_dev_test.go:1219

    [odo] Error: Cannot find module 'prom-client'
    [odo] Require stack:
    [odo] - /projects/server.js
    [odo]     at Function.Module._resolveFilename (internal/modules/cjs/loader.js:902:15)
    [odo]     at Function.Module._load (internal/modules/cjs/loader.js:746:27)
    [odo]     at Module.require (internal/modules/cjs/loader.js:974:19)
    [odo]     at require (internal/modules/cjs/helpers.js:93:18)
    [odo]     at Object.<anonymous> (/projects/server.js:1:20)
    [odo]     at Module._compile (internal/modules/cjs/loader.js:1085:14)
    [odo]     at Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
    [odo]     at Module.load (internal/modules/cjs/loader.js:950:32)
    [odo]     at Function.Module._load (internal/modules/cjs/loader.js:790:12)
    [odo]     at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12) {
    [odo]   code: 'MODULE_NOT_FOUND',
    [odo]   requireStack: [ '/projects/server.js' ]
    [odo] }

@valaparthvi
I'm seeing this error regularly in Openshift IBM Cloud tests (in this PR but also other PRs). I've made this commit to try and fix it: 4d23968

WDYT?

feloy · 2022-08-10T14:43:31Z

Restart tests to check flakiness of fixed test

rm3l

Added a few comments.

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

LGTM overall otherwise.

pkg/portForward/portForward.go

feloy · 2022-08-11T06:10:37Z

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

I was also inclined to do this, but this would break the integration tests, as we are not guaranteed that the port is not taken by another process during the restart of the port forwarding.

sonarqubecloud · 2022-08-11T06:17:44Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

rm3l · 2022-08-11T07:41:44Z

You already mentioned that, but the more I used the --random-ports flag, the more I found it confusing that users get assigned a new random port when the port forwarding is restarted (in the same Dev session). If we could store the previous ports in memory and try to reuse them first, I think this would improve the user experience. But this can be addressed in a separate issue.

I was also inclined to do this, but this would break the integration tests, as we are not guaranteed that the port is not taken by another process during the restart of the port forwarding.

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

rm3l

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

feloy · 2022-08-11T07:50:40Z

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

feloy · 2022-08-11T07:55:51Z

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

That's an interesting challenge. I would like to have a try. Either in this PR or another one

rm3l · 2022-08-11T08:03:29Z

One last comment - do you think unit-testing the methods in portForward.go could be do-able? I understand it might not be that easy to do, especially with the different goroutines we have and the global runtime.ErrorHandlers variable we use..

That's an interesting challenge. I would like to have a try. Either in this PR or another one

As there was already no unit tests when portForward.go was introduced the first time, it is okay to me if unit tests are done in a separate PR (we should focus on the upcoming release candidate).

rm3l

Thanks for your work on this.

feloy · 2022-08-11T08:14:19Z

/override ci/prow/v4.10-integration-e2e

feloy · 2022-08-11T08:15:04Z

/override windows-integration-test/Windows-test

openshift-ci · 2022-08-11T08:15:55Z

@feloy: Overrode contexts on behalf of feloy: windows-integration-test/Windows-test

In response to this:

/override windows-integration-test/Windows-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-08-11T08:15:57Z

@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e, windows-integration-test/Windows-test

In response to this:

/override ci/prow/v4.10-integration-e2e
/override windows-integration-test/Windows-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-08-11T08:15:59Z

@feloy: Overrode contexts on behalf of feloy: ci/prow/v4.10-integration-e2e

In response to this:

/override ci/prow/v4.10-integration-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dharmit · 2022-08-11T11:54:23Z

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

I would rather not expose that ~~port~~ flag to the users at all. Opinionated. 😉

rm3l · 2022-08-11T12:45:47Z

I see. I think we can find a trade-off by letting users decide (via a yet-another flag for example) which behavior to adopt w.r.t random ports. Default behavior could be to try to reuse the ports, but in integration tests, it would be acceptable to have new ports assigned.

I would prefer to wait and see if people are using this random-ports flag. It has been made for integration tests, I'm not sure there is a real use case for developers. And I would not like to add complexity to this part of the code which is already quite imbricated, if not necessary.

I would rather not expose that ~~port~~ flag to the users at all. Opinionated. wink

Indeed. If possible to hide it (mostly from the user-facing Help output), it would make more sense not to expose it 👍🏿

This reverts commit 53a7c3c.

* Restart port forwarding on failure * Save ports again when port forward is restarted * Integration test * Update pkg/portForward/portForward.go Co-authored-by: Dharmit Shah <shahdharmit@gmail.com> * Fix rebase * Fix integration test with run composite command * Copy errorhandlers * Add timeout for first-time port forwarding Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. Required by Prow. kind/bug Categorizes issue or PR as related to a bug. labels Aug 8, 2022

openshift-ci bot requested review from cdrage and dharmit August 8, 2022 14:13

feloy force-pushed the bugfix-5877/restart-port-forward-on-failure branch from 1ea5d14 to a8815a9 Compare August 9, 2022 14:27

feloy changed the title ~~[wip] Restart port forwarding on failure~~ Restart port forwarding on failure Aug 9, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. Required by Prow. label Aug 9, 2022

feloy requested review from rm3l and removed request for cdrage August 9, 2022 15:29

feloy closed this Aug 9, 2022

feloy reopened this Aug 9, 2022

feloy force-pushed the bugfix-5877/restart-port-forward-on-failure branch from a8815a9 to 93b1a83 Compare August 9, 2022 15:49

dharmit reviewed Aug 10, 2022

View reviewed changes

feloy and others added 4 commits August 10, 2022 12:51

Restart port forwarding on failure

a18704e

Save ports again when port forward is restarted

46b7ba7

Integration test

4d40bd5

Update pkg/portForward/portForward.go

3ddb2c4

Co-authored-by: Dharmit Shah <shahdharmit@gmail.com>

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. Required by Prow. label Aug 10, 2022

openshift-ci bot assigned dharmit Aug 10, 2022

Fix integration test with run composite command

4d23968

feloy closed this Aug 10, 2022

feloy reopened this Aug 10, 2022

rm3l reviewed Aug 10, 2022

View reviewed changes

pkg/portForward/portForward.go Outdated Show resolved Hide resolved

pkg/portForward/portForward.go Outdated Show resolved Hide resolved

Copy errorhandlers

0fe2800

Add timeout for first-time port forwarding

639bf34

rm3l reviewed Aug 11, 2022

View reviewed changes

rm3l approved these changes Aug 11, 2022

View reviewed changes

openshift-ci bot assigned rm3l Aug 11, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. Required by Prow. label Aug 11, 2022

openshift-merge-robot merged commit 53a7c3c into redhat-developer:main Aug 11, 2022

feloy added a commit to feloy/odo that referenced this pull request Aug 18, 2022

Revert "Restart port forwarding on failure (redhat-developer#6013)"

7281750

This reverts commit 53a7c3c.

rm3l mentioned this pull request Sep 2, 2022

Port forwarding needs to be able to handle Pod/container restarts #5592

Closed

rm3l mentioned this pull request Jan 26, 2023

Fix potential issues with E2E tests #6543

Merged

3 tasks

		o.originalErrorHandlers = runtime.ErrorHandlers
		runtime.ErrorHandlers = append(runtime.ErrorHandlers, func(err error) {

		@@ -58,7 +66,6 @@ func (o *PFClient) StartPortForwarding(
		o.StopPortForwarding()

		o.stopChan = make(chan struct{}, 1)

	if podChanged \|\| portsChanged {
	a.portForwardClient.StopPortForwarding()
	}

	err = a.portForwardClient.StartPortForwarding(a.Devfile, a.ComponentName, parameters.RandomPorts, parameters.ErrOut)

Restart port forwarding on failure #6013

Restart port forwarding on failure #6013

Conversation

feloy commented Aug 8, 2022

netlify bot commented Aug 8, 2022 • edited Loading

✅ Deploy Preview for odo-docusaurus-preview canceled.

odo-robot bot commented Aug 8, 2022 • edited Loading

odo-robot bot commented Aug 8, 2022 • edited Loading

odo-robot bot commented Aug 8, 2022 • edited Loading

odo-robot bot commented Aug 8, 2022 • edited Loading

odo-robot bot commented Aug 8, 2022 • edited Loading

dharmit commented Aug 9, 2022

feloy commented Aug 9, 2022

dharmit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dharmit Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valaparthvi commented Aug 10, 2022

openshift-ci bot commented Aug 10, 2022

dharmit commented Aug 10, 2022

feloy commented Aug 10, 2022

feloy commented Aug 10, 2022

rm3l left a comment

Choose a reason for hiding this comment

feloy commented Aug 11, 2022

sonarqubecloud bot commented Aug 11, 2022

rm3l commented Aug 11, 2022 • edited Loading

rm3l left a comment

Choose a reason for hiding this comment

feloy commented Aug 11, 2022

feloy commented Aug 11, 2022

rm3l commented Aug 11, 2022

rm3l left a comment • edited Loading

Choose a reason for hiding this comment

feloy commented Aug 11, 2022

feloy commented Aug 11, 2022

openshift-ci bot commented Aug 11, 2022

openshift-ci bot commented Aug 11, 2022

openshift-ci bot commented Aug 11, 2022

dharmit commented Aug 11, 2022 • edited Loading

rm3l commented Aug 11, 2022

netlify bot commented Aug 8, 2022 •

edited

Loading

odo-robot bot commented Aug 8, 2022 •

edited

Loading

odo-robot bot commented Aug 8, 2022 •

edited

Loading

odo-robot bot commented Aug 8, 2022 •

edited

Loading

odo-robot bot commented Aug 8, 2022 •

edited

Loading

odo-robot bot commented Aug 8, 2022 •

edited

Loading

dharmit Aug 10, 2022 •

edited

Loading

rm3l commented Aug 11, 2022 •

edited

Loading

rm3l left a comment •

edited

Loading

dharmit commented Aug 11, 2022 •

edited

Loading