Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite flakey workspace-in-sidecar example #4349

Merged
merged 1 commit into from Nov 11, 2021
Merged

Rewrite flakey workspace-in-sidecar example #4349

merged 1 commit into from Nov 11, 2021

Conversation

ghost
Copy link

@ghost ghost commented Nov 2, 2021

Changes

Issue: #4169

The workspace-in-sidecar example taskruns are the source of regular
flakes in Pipelines' CI. The examples work by using files in a shared
emptyDir volume to synchronize the behaviour of two containers.

This commit introduces a named pipe for synchronizing behaviour between
the two containers, removing one of the file polling loops. The size of the
shared volume has been set extremely low (just in case the problem is related
to disk pressure on the kubelet) and extra log lines are also included to help
narrow down where a freeze might be occurring.

Adding a hold to see if we can get the CI to fail on workspace-in-sidecar
examples for debugging.

/hold

/kind flake

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Release Notes

NONE

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/flake Categorizes issue or PR as related to a flakey test labels Nov 2, 2021
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 2, 2021
@bobcatfish
Copy link
Collaborator

i havent seen a named pipe in so long 🤩

(if you want to try to get the CI to trigger this, one terribly hacky thing you could do is update the test logic to run this test a bunch of times - like 100 times)

@bobcatfish
Copy link
Collaborator

/test check-pr-has-kind-label

@bobcatfish
Copy link
Collaborator

@sbwsg i wonder if it's possible this is some kind of interaction between the volumeMounts specification and workspaces - probably a red herring but i was surprised to see this example wasn't using the sidecar workspaces feature (i guess it pre-dates it?)

@ghost
Copy link
Author

ghost commented Nov 2, 2021

@bobcatfish great idea about running the test lots of times, I'll give that a shot!

The alpha example uses Sidecar Workspaces - it's automatic so the sidecar gets access to the workspace without the explicit volumeMount that appears in the non-alpha copy of the taskrun.

@ghost
Copy link
Author

ghost commented Nov 3, 2021

/test pull-tekton-pipeline-integration-tests

2 similar comments
@ghost
Copy link
Author

ghost commented Nov 3, 2021

/test pull-tekton-pipeline-integration-tests

@ghost
Copy link
Author

ghost commented Nov 3, 2021

/test pull-tekton-pipeline-integration-tests

The workspace-in-sidecar example taskruns are the source of regular
flakes in Pipelines' CI. The examples work by using files in a shared
emptyDir volume to synchronize the behaviour of two containers.

This commit introduces a named pipe for synchronizing behaviour
between the two containers, removing one of the file polling loops.
The size of the resource requests and shared volume has been set
extremely low (just in case the errors are related to disk pressure or
resource starvation on the kubelet) and extra log lines are also
included to help narrow down where a freeze might be occurring in
future.
@ghost
Copy link
Author

ghost commented Nov 3, 2021

I updated the test runner to repeat the workspace-in-sidecar example 20 times per run, and then re-ran the suite 3 times. So after 60 executions the workspace-in-sidecar example didn't hit a timeout or error.

I don't think that this PR necessarily solves the problem with the example but hopefully the extra log lines I've added will help surface the real underlying problem when it rears its head again.

@ghost
Copy link
Author

ghost commented Nov 3, 2021

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2021
@bobcatfish
Copy link
Collaborator

So after 60 executions the workspace-in-sidecar example didn't hit a timeout or error.

i guess you jinxed it @sbwsg X'D

that's one way to reproduce an error i suppose XD

@ghost
Copy link
Author

ghost commented Nov 3, 2021

/test pull-tekton-pipeline-alpha-integration-tests

Frustratingly the tests that failed during this alpha integration run weren't the workspace-in-sidecar examples :(

@ghost
Copy link
Author

ghost commented Nov 3, 2021

The isolated workspaces example failed because it took longer than 1 minute to complete.

--- FAIL: TestExamples/v1beta1/pipelineruns/alpha/isolated-workspaces (62.23s)

The TestDuplicatePodTaskRun integration test also failed but with no clear explanation. I did notice that it looks like 50 taskrun pods were spun up as part of that test.

Edit: OK so the 50 separate pods is likely because there are two copies of TestDuplicatePodTaskRun: one v1alpha1 and one v1beta1. Each of those tests creates 25 TaskRuns.

@ghost
Copy link
Author

ghost commented Nov 3, 2021

/test pull-tekton-pipeline-alpha-integration-tests

@ghost ghost mentioned this pull request Nov 5, 2021
4 tasks
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excited to see this fix as i just hit this issue in another pr! thanks @sbwsg 🎉

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2021
@pritidesai
Copy link
Member

thank you @sbwsg for detailed explanation 🙏 hoping not to see this flake again 🤣

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2021
@ghost
Copy link
Author

ghost commented Nov 10, 2021

/test pull-tekton-pipeline-alpha-integration-tests

@pritidesai
Copy link
Member

    examples_test.go:62: Failed waiting for task run done: "workspace-in-sidecar-h74lx" failed
    build_logs.go:35: Could not get logs for pod workspace-in-sidecar-h74lx-pod-swrd6: container "step-unnamed-0" in pod "workspace-in-sidecar-h74lx-pod-swrd6" is waiting to start: PodInitializing

flake while fixing flake 🤔

/test pull-tekton-pipeline-alpha-integration-tests

@tekton-robot tekton-robot merged commit 52590be into tektoncd:main Nov 11, 2021
@ghost
Copy link
Author

ghost commented Nov 12, 2021

Sigh, ok clearly my changes don't fix the flake :D

Based on timestamps from the taskrun and pod here's the order of operations:

(T=0s)  TaskRun created                   2021-11-10T21:00:58Z
(T=1s)  Pod scheduled, created            2021-11-10T21:00:59Z
(T=18s) Pod reaches Initialized condition 2021-11-10T21:01:16Z
(T=60s) Step times out, marked finished   2021-11-10T21:01:58Z
(T=90s) Pod deleted                       2021-11-10T21:02:28Z

Couple observations here:

  1. When the timeout was hit the test runner recorded the Pod's state. Both the step and sidecar containers were marked as started: false with reason: PodInitializing. This contradicts the Pod's condition which suggests that it was fully initialized at T=18s. I can't explain this difference.
  2. In tandem with (1), no logs were captured because container "step-unnamed-0" in pod "workspace-in-sidecar-h74lx-pod-swrd6" is waiting to start: PodInitializing. Super weird that the Pod was in an Initialized state after 18 seconds but the containers were still in a holding pattern after 60.
  3. The Step really only had 42 seconds to complete, not 1 minute, since initialization took 18s.
  4. Duration from timeout to pod deletion is exactly 30s, which matches the pod's termination grace period. Trapping SIGINT in the sidecar and exiting immediately might speed this up and release resources back to the cluster sooner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flakey test lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants