Makefile: Bump test-e2e timeout to 90 minutes #2474

wking · 2021-03-20T21:15:33Z

Recent jobs have seen a lot of:

panic: test timed out after 1h15m0s

Nothing looks wrong with the test-case that was running at panic time,
there are just a bunch of slow cases, and we've already consumed most
of the 1.25 hours before we start in on the panicking case:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | sed -n 's/.*PASS.*[(]\([0-9]*\.[0-9]*\)s)$/\1/p' | awk 'BEGIN{s = 0}{s += $1}END{print s / 3600}'
1.17946

Breaking it down by test:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | grep PASS
--- PASS: TestContainerRuntimeConfigLogLevel (550.98s)
--- PASS: TestKubeletConfigMaxPods (584.46s)
--- PASS: TestMCDToken (0.27s)
--- PASS: TestMCDeployed (1684.42s)
--- PASS: TestKernelArguments (343.36s)
--- PASS: TestKernelType (469.23s)
--- PASS: TestExtensions (522.46s)
--- PASS: TestNoReboot (40.48s)
--- PASS: TestPoolDegradedOnFailToRender (14.12s)
--- PASS: TestReconcileAfterBadMC (36.26s)

Comparing with a run from a month ago:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856/artifacts/e2e-gcp-op/test/build-log.txt | grep PASS
--- PASS: TestContainerRuntimeConfigLogLevel (433.36s)
--- PASS: TestKubeletConfigMaxPods (500.96s)
--- PASS: TestMCDToken (0.28s)
--- PASS: TestMCDeployed (1322.38s)
--- PASS: TestKernelArguments (329.53s)
--- PASS: TestKernelType (350.81s)
--- PASS: TestExtensions (362.20s)
--- PASS: TestNoReboot (131.19s)
--- PASS: TestPoolDegradedOnFailToRender (14.12s)
--- PASS: TestReconcileAfterBadMC (96.25s)
--- PASS: TestDontDeleteRPMFiles (283.51s)
--- PASS: TestIgn3Cfg (136.66s)
--- PASS: TestClusterOperatorRelatedObjects (0.02s)
--- PASS: TestMastersSchedulable (2.32s)
--- PASS: TestClusterOperatorStatusExtension (0.07s)
--- PASS: TestOSImageURL (0.10s)
--- PASS: TestOperatorLabel (0.20s)
PASS

Some cases are slower, some are faster, but nothing looks outrageous. And there doesn't seem to have been much running after TestDontDeleteRPMFiles that was all that slow. So bump the cap by 15 minutes to give us a bit more breathing room.

Recent jobs have seen a lot of [1]: panic: test timed out after 1h15m0s Nothing looks wrong with the test-case that was running at panic time, there are just a bunch of slow cases, and we've already consumed most of the 1.25 hours before we start in on the panicking case: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | sed -n 's/.*PASS.*[(]$[0-9]*\.[0-9]*$s)$/\1/p' | awk 'BEGIN{s = 0}{s += $1}END{print s / 3600}' 1.17946 Breaking it down by test: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | grep PASS --- PASS: TestContainerRuntimeConfigLogLevel (550.98s) --- PASS: TestKubeletConfigMaxPods (584.46s) --- PASS: TestMCDToken (0.27s) --- PASS: TestMCDeployed (1684.42s) --- PASS: TestKernelArguments (343.36s) --- PASS: TestKernelType (469.23s) --- PASS: TestExtensions (522.46s) --- PASS: TestNoReboot (40.48s) --- PASS: TestPoolDegradedOnFailToRender (14.12s) --- PASS: TestReconcileAfterBadMC (36.26s) Comparing with a run from a month ago [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856/artifacts/e2e-gcp-op/test/build-log.txt | grep PASS --- PASS: TestContainerRuntimeConfigLogLevel (433.36s) --- PASS: TestKubeletConfigMaxPods (500.96s) --- PASS: TestMCDToken (0.28s) --- PASS: TestMCDeployed (1322.38s) --- PASS: TestKernelArguments (329.53s) --- PASS: TestKernelType (350.81s) --- PASS: TestExtensions (362.20s) --- PASS: TestNoReboot (131.19s) --- PASS: TestPoolDegradedOnFailToRender (14.12s) --- PASS: TestReconcileAfterBadMC (96.25s) --- PASS: TestDontDeleteRPMFiles (283.51s) --- PASS: TestIgn3Cfg (136.66s) --- PASS: TestClusterOperatorRelatedObjects (0.02s) --- PASS: TestMastersSchedulable (2.32s) --- PASS: TestClusterOperatorStatusExtension (0.07s) --- PASS: TestOSImageURL (0.10s) --- PASS: TestOperatorLabel (0.20s) PASS Some cases are slower, some are faster, but nothing looks outrageous. And there doesn't seem to have been much running after TestDontDeleteRPMFiles that was all that slow. So bump the cap by 15 minutes to give us a bit more breathing room. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016#1:build-log.txt%3A177 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856

yuqi-zhang · 2021-03-21T00:26:47Z

/approve
/lgtm
to smooth testing a bit, we are aware of its increasing failure rate, and will look into it. Historically speaking one major source of slowdown is non-gracefully terminating pods, which could be a good direction to look into next week

openshift-ci-robot · 2021-03-21T00:27:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2021-03-21T02:38:03Z

Dunno what the segfault is about:

/test unit

wking · 2021-03-21T22:39:55Z

/retest

openshift-ci · 2021-03-22T00:03:07Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-workers-rhel7	`36b6d53`	link	`/test e2e-aws-workers-rhel7`
ci/prow/okd-e2e-aws	`36b6d53`	link	`/test okd-e2e-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Test time was bumped in 4.8 from 75m to 90m via: openshift#2474 Test has been failing often in 4.7 and will be a little longer consistently as the exponential backoff has been removed in favor of 5 min retries (which is faster in slower cases but a little slower in faster cases).

Test time was bumped in 4.8 from 75m to 90m via: openshift#2474 Test has been failing often in 4.7 and will be a little longer consistently as the exponential backoff has been removed in favor of 5 min retries (which is faster in slower cases but a little slower in faster cases). (cherry picked from commit 654bd00)

openshift-ci-robot requested review from kikisdeliveryservice and sinnykumari March 20, 2021 21:15

wking mentioned this pull request Mar 20, 2021

Bug 1935539: vSphere: Disable tx udp_csum segmentation #2472

Merged

openshift-ci-robot assigned yuqi-zhang Mar 21, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2021

openshift-merge-robot merged commit d6dabad into openshift:master Mar 22, 2021

wking deleted the bump-e2e-op-timeout branch March 22, 2021 04:18

cgwalters mentioned this pull request Jan 25, 2022

Create PoC for booting from in-cluster built image #2886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Makefile: Bump test-e2e timeout to 90 minutes #2474

Makefile: Bump test-e2e timeout to 90 minutes #2474

wking commented Mar 20, 2021 •

edited

Loading

yuqi-zhang commented Mar 21, 2021

openshift-ci-robot commented Mar 21, 2021

wking commented Mar 21, 2021

wking commented Mar 21, 2021

openshift-ci bot commented Mar 22, 2021 •

edited

Loading

Makefile: Bump test-e2e timeout to 90 minutes #2474

Makefile: Bump test-e2e timeout to 90 minutes #2474

Conversation

wking commented Mar 20, 2021 • edited Loading

yuqi-zhang commented Mar 21, 2021

openshift-ci-robot commented Mar 21, 2021

wking commented Mar 21, 2021

wking commented Mar 21, 2021

openshift-ci bot commented Mar 22, 2021 • edited Loading

wking commented Mar 20, 2021 •

edited

Loading

openshift-ci bot commented Mar 22, 2021 •

edited

Loading