Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makefile: Bump test-e2e timeout to 90 minutes #2474

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

wking
Copy link
Member

@wking wking commented Mar 20, 2021

Recent jobs have seen a lot of:

panic: test timed out after 1h15m0s

Nothing looks wrong with the test-case that was running at panic time,
there are just a bunch of slow cases, and we've already consumed most
of the 1.25 hours before we start in on the panicking case:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | sed -n 's/.*PASS.*[(]\([0-9]*\.[0-9]*\)s)$/\1/p' | awk 'BEGIN{s = 0}{s += $1}END{print s / 3600}'
1.17946

Breaking it down by test:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | grep PASS
--- PASS: TestContainerRuntimeConfigLogLevel (550.98s)
--- PASS: TestKubeletConfigMaxPods (584.46s)
--- PASS: TestMCDToken (0.27s)
--- PASS: TestMCDeployed (1684.42s)
--- PASS: TestKernelArguments (343.36s)
--- PASS: TestKernelType (469.23s)
--- PASS: TestExtensions (522.46s)
--- PASS: TestNoReboot (40.48s)
--- PASS: TestPoolDegradedOnFailToRender (14.12s)
--- PASS: TestReconcileAfterBadMC (36.26s)

Comparing with a run from a month ago:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856/artifacts/e2e-gcp-op/test/build-log.txt | grep PASS
--- PASS: TestContainerRuntimeConfigLogLevel (433.36s)
--- PASS: TestKubeletConfigMaxPods (500.96s)
--- PASS: TestMCDToken (0.28s)
--- PASS: TestMCDeployed (1322.38s)
--- PASS: TestKernelArguments (329.53s)
--- PASS: TestKernelType (350.81s)
--- PASS: TestExtensions (362.20s)
--- PASS: TestNoReboot (131.19s)
--- PASS: TestPoolDegradedOnFailToRender (14.12s)
--- PASS: TestReconcileAfterBadMC (96.25s)
--- PASS: TestDontDeleteRPMFiles (283.51s)
--- PASS: TestIgn3Cfg (136.66s)
--- PASS: TestClusterOperatorRelatedObjects (0.02s)
--- PASS: TestMastersSchedulable (2.32s)
--- PASS: TestClusterOperatorStatusExtension (0.07s)
--- PASS: TestOSImageURL (0.10s)
--- PASS: TestOperatorLabel (0.20s)
PASS

Some cases are slower, some are faster, but nothing looks outrageous. And there doesn't seem to have been much running after TestDontDeleteRPMFiles that was all that slow. So bump the cap by 15 minutes to give us a bit more breathing room.

Recent jobs have seen a lot of [1]:

  panic: test timed out after 1h15m0s

Nothing looks wrong with the test-case that was running at panic time,
there are just a bunch of slow cases, and we've already consumed most
of the 1.25 hours before we start in on the panicking case:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | sed -n 's/.*PASS.*[(]\([0-9]*\.[0-9]*\)s)$/\1/p' | awk 'BEGIN{s = 0}{s += $1}END{print s / 3600}'
  1.17946

Breaking it down by test:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016/build-log.txt | grep PASS
  --- PASS: TestContainerRuntimeConfigLogLevel (550.98s)
  --- PASS: TestKubeletConfigMaxPods (584.46s)
  --- PASS: TestMCDToken (0.27s)
  --- PASS: TestMCDeployed (1684.42s)
  --- PASS: TestKernelArguments (343.36s)
  --- PASS: TestKernelType (469.23s)
  --- PASS: TestExtensions (522.46s)
  --- PASS: TestNoReboot (40.48s)
  --- PASS: TestPoolDegradedOnFailToRender (14.12s)
  --- PASS: TestReconcileAfterBadMC (36.26s)

Comparing with a run from a month ago [2]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856/artifacts/e2e-gcp-op/test/build-log.txt | grep PASS
  --- PASS: TestContainerRuntimeConfigLogLevel (433.36s)
  --- PASS: TestKubeletConfigMaxPods (500.96s)
  --- PASS: TestMCDToken (0.28s)
  --- PASS: TestMCDeployed (1322.38s)
  --- PASS: TestKernelArguments (329.53s)
  --- PASS: TestKernelType (350.81s)
  --- PASS: TestExtensions (362.20s)
  --- PASS: TestNoReboot (131.19s)
  --- PASS: TestPoolDegradedOnFailToRender (14.12s)
  --- PASS: TestReconcileAfterBadMC (96.25s)
  --- PASS: TestDontDeleteRPMFiles (283.51s)
  --- PASS: TestIgn3Cfg (136.66s)
  --- PASS: TestClusterOperatorRelatedObjects (0.02s)
  --- PASS: TestMastersSchedulable (2.32s)
  --- PASS: TestClusterOperatorStatusExtension (0.07s)
  --- PASS: TestOSImageURL (0.10s)
  --- PASS: TestOperatorLabel (0.20s)
  PASS

Some cases are slower, some are faster, but nothing looks outrageous.
And there doesn't seem to have been much running after
TestDontDeleteRPMFiles that was all that slow.  So bump the cap by 15
minutes to give us a bit more breathing room.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2472/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1373072029300822016#1:build-log.txt%3A177
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2406/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1362580031678713856
@yuqi-zhang
Copy link
Contributor

/approve
/lgtm
to smooth testing a bit, we are aware of its increasing failure rate, and will look into it. Historically speaking one major source of slowdown is non-gracefully terminating pods, which could be a good direction to look into next week

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 21, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2021
@wking
Copy link
Member Author

wking commented Mar 21, 2021

Dunno what the segfault is about:

/test unit

@wking
Copy link
Member Author

wking commented Mar 21, 2021

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 22, 2021

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-workers-rhel7 36b6d53 link /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws 36b6d53 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d6dabad into openshift:master Mar 22, 2021
@wking wking deleted the bump-e2e-op-timeout branch March 22, 2021 04:18
kikisdeliveryservice added a commit to kikisdeliveryservice/machine-config-operator that referenced this pull request Jun 10, 2021
Test time was bumped in 4.8 from 75m to 90m via:
openshift#2474

Test has been failing often in 4.7 and will be a little longer
consistently as the exponential backoff has been removed in favor
of 5 min retries (which is faster in slower cases but a little
slower in faster cases).
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/machine-config-operator that referenced this pull request Jul 30, 2021
Test time was bumped in 4.8 from 75m to 90m via:
openshift#2474

Test has been failing often in 4.7 and will be a little longer
consistently as the exponential backoff has been removed in favor
of 5 min retries (which is faster in slower cases but a little
slower in faster cases).
kikisdeliveryservice added a commit to kikisdeliveryservice/machine-config-operator that referenced this pull request Aug 9, 2021
Test time was bumped in 4.8 from 75m to 90m via:
openshift#2474

Test has been failing often in 4.7 and will be a little longer
consistently as the exponential backoff has been removed in favor
of 5 min retries (which is faster in slower cases but a little
slower in faster cases).

(cherry picked from commit 654bd00)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants