Rewrite flakey workspace-in-sidecar example #4349

ghost · 2021-11-02T20:07:59Z

Changes

The workspace-in-sidecar example taskruns are the source of regular
flakes in Pipelines' CI. The examples work by using files in a shared
emptyDir volume to synchronize the behaviour of two containers.

This commit introduces a named pipe for synchronizing behaviour between
the two containers, removing one of the file polling loops. The size of the
shared volume has been set extremely low (just in case the problem is related
to disk pressure on the kubelet) and extra log lines are also included to help
narrow down where a freeze might be occurring.

Adding a hold to see if we can get the CI to fail on workspace-in-sidecar
examples for debugging.

/hold

/kind flake

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

NONE

bobcatfish · 2021-11-02T22:56:14Z

i havent seen a named pipe in so long 🤩

(if you want to try to get the CI to trigger this, one terribly hacky thing you could do is update the test logic to run this test a bunch of times - like 100 times)

bobcatfish · 2021-11-02T23:01:45Z

/test check-pr-has-kind-label

bobcatfish · 2021-11-02T23:13:52Z

@sbwsg i wonder if it's possible this is some kind of interaction between the volumeMounts specification and workspaces - probably a red herring but i was surprised to see this example wasn't using the sidecar workspaces feature (i guess it pre-dates it?)

ghost · 2021-11-02T23:45:14Z

@bobcatfish great idea about running the test lots of times, I'll give that a shot!

The alpha example uses Sidecar Workspaces - it's automatic so the sidecar gets access to the workspace without the explicit volumeMount that appears in the non-alpha copy of the taskrun.

ghost · 2021-11-03T15:11:33Z

/test pull-tekton-pipeline-integration-tests

ghost · 2021-11-03T15:30:33Z

/test pull-tekton-pipeline-integration-tests

ghost · 2021-11-03T15:50:01Z

/test pull-tekton-pipeline-integration-tests

The workspace-in-sidecar example taskruns are the source of regular flakes in Pipelines' CI. The examples work by using files in a shared emptyDir volume to synchronize the behaviour of two containers. This commit introduces a named pipe for synchronizing behaviour between the two containers, removing one of the file polling loops. The size of the resource requests and shared volume has been set extremely low (just in case the errors are related to disk pressure or resource starvation on the kubelet) and extra log lines are also included to help narrow down where a freeze might be occurring in future.

ghost · 2021-11-03T16:47:38Z

I updated the test runner to repeat the workspace-in-sidecar example 20 times per run, and then re-ran the suite 3 times. So after 60 executions the workspace-in-sidecar example didn't hit a timeout or error.

I don't think that this PR necessarily solves the problem with the example but hopefully the extra log lines I've added will help surface the real underlying problem when it rears its head again.

ghost · 2021-11-03T16:50:25Z

/hold cancel

bobcatfish · 2021-11-03T17:21:21Z

So after 60 executions the workspace-in-sidecar example didn't hit a timeout or error.

i guess you jinxed it @sbwsg X'D

that's one way to reproduce an error i suppose XD

ghost · 2021-11-03T17:54:47Z

/test pull-tekton-pipeline-alpha-integration-tests

Frustratingly the tests that failed during this alpha integration run weren't the workspace-in-sidecar examples :(

ghost · 2021-11-03T19:00:02Z

The isolated workspaces example failed because it took longer than 1 minute to complete.

--- FAIL: TestExamples/v1beta1/pipelineruns/alpha/isolated-workspaces (62.23s)

The TestDuplicatePodTaskRun integration test also failed but with no clear explanation. I did notice that it looks like 50 taskrun pods were spun up as part of that test.

Edit: OK so the 50 separate pods is likely because there are two copies of TestDuplicatePodTaskRun: one v1alpha1 and one v1beta1. Each of those tests creates 25 TaskRuns.

ghost · 2021-11-03T19:36:02Z

/test pull-tekton-pipeline-alpha-integration-tests

jerop

excited to see this fix as i just hit this issue in another pr! thanks @sbwsg 🎉

tekton-robot · 2021-11-10T15:33:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jerop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jerop]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pritidesai · 2021-11-10T19:56:27Z

thank you @sbwsg for detailed explanation 🙏 hoping not to see this flake again 🤣

/lgtm

ghost · 2021-11-10T20:42:23Z

/test pull-tekton-pipeline-alpha-integration-tests

pritidesai · 2021-11-11T01:23:47Z

    examples_test.go:62: Failed waiting for task run done: "workspace-in-sidecar-h74lx" failed
    build_logs.go:35: Could not get logs for pod workspace-in-sidecar-h74lx-pod-swrd6: container "step-unnamed-0" in pod "workspace-in-sidecar-h74lx-pod-swrd6" is waiting to start: PodInitializing

flake while fixing flake 🤔

/test pull-tekton-pipeline-alpha-integration-tests

ghost · 2021-11-12T14:36:14Z

Sigh, ok clearly my changes don't fix the flake :D

Based on timestamps from the taskrun and pod here's the order of operations:

(T=0s)  TaskRun created                   2021-11-10T21:00:58Z
(T=1s)  Pod scheduled, created            2021-11-10T21:00:59Z
(T=18s) Pod reaches Initialized condition 2021-11-10T21:01:16Z
(T=60s) Step times out, marked finished   2021-11-10T21:01:58Z
(T=90s) Pod deleted                       2021-11-10T21:02:28Z

Couple observations here:

When the timeout was hit the test runner recorded the Pod's state. Both the step and sidecar containers were marked as started: false with reason: PodInitializing. This contradicts the Pod's condition which suggests that it was fully initialized at T=18s. I can't explain this difference.
In tandem with (1), no logs were captured because container "step-unnamed-0" in pod "workspace-in-sidecar-h74lx-pod-swrd6" is waiting to start: PodInitializing. Super weird that the Pod was in an Initialized state after 18 seconds but the containers were still in a holding pattern after 60.
The Step really only had 42 seconds to complete, not 1 minute, since initialization took 18s.
Duration from timeout to pod deletion is exactly 30s, which matches the pod's termination grace period. Trapping SIGINT in the sidecar and exiting immediately might speed this up and release resources back to the cluster sooner.

tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/flake Categorizes issue or PR as related to a flakey test labels Nov 2, 2021

tekton-robot requested review from afrittoli and dlorenc November 2, 2021 20:08

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 2, 2021

tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 3, 2021

ghost mentioned this pull request Nov 5, 2021

Reduce integration test run times #4353

Merged

4 tasks

jerop approved these changes Nov 10, 2021

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2021

tekton-robot assigned pritidesai Nov 10, 2021

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2021

tekton-robot merged commit 52590be into tektoncd:main Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite flakey workspace-in-sidecar example #4349

Rewrite flakey workspace-in-sidecar example #4349

ghost commented Nov 2, 2021 •

edited by ghost

Loading

bobcatfish commented Nov 2, 2021

bobcatfish commented Nov 2, 2021

bobcatfish commented Nov 2, 2021

ghost commented Nov 2, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

bobcatfish commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021 •

edited by ghost

Loading

ghost commented Nov 3, 2021

jerop left a comment

tekton-robot commented Nov 10, 2021

pritidesai commented Nov 10, 2021

ghost commented Nov 10, 2021

pritidesai commented Nov 11, 2021

ghost commented Nov 12, 2021

Rewrite flakey workspace-in-sidecar example #4349

Rewrite flakey workspace-in-sidecar example #4349

Conversation

ghost commented Nov 2, 2021 • edited by ghost Loading

Changes

Submitter Checklist

Release Notes

bobcatfish commented Nov 2, 2021

bobcatfish commented Nov 2, 2021

bobcatfish commented Nov 2, 2021

ghost commented Nov 2, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021

bobcatfish commented Nov 3, 2021

ghost commented Nov 3, 2021

ghost commented Nov 3, 2021 • edited by ghost Loading

ghost commented Nov 3, 2021

jerop left a comment

Choose a reason for hiding this comment

tekton-robot commented Nov 10, 2021

pritidesai commented Nov 10, 2021

ghost commented Nov 10, 2021

pritidesai commented Nov 11, 2021

ghost commented Nov 12, 2021

ghost commented Nov 2, 2021 •

edited by ghost

Loading

ghost commented Nov 3, 2021 •

edited by ghost

Loading