[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

spiffxp · 2017-12-06T03:14:31Z

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
@kubernetes/sig-cluster-lifecycle-test-failures

This job has been failing since at least 2017-11-21. It's on the sig-release-master-upgrade dashboard,
and prevents us from cutting [v1.9.0-beta.2] (kubernetes/sig-release#39). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster-parallel

latest failure: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/829

kubetest --timeout triggered

The text was updated successfully, but these errors were encountered:

spiffxp · 2017-12-11T17:48:35Z

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

Maybe same issue as #56244 ?

krousey · 2017-12-12T18:41:46Z

I think I've fixed issues with the non-parallel one (both node and master downgrade failures), but this seems weird. I think there's an error in how it's configured.

From the normal downgrade (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/178?log#log):

W1211 12:18:26.502] 2017/12/11 12:18:26 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1211 12:18:26.506] Project: kubernetes-es-logging
W1211 12:18:26.506] Network Project: kubernetes-es-logging
W1211 12:18:26.506] Zone: us-central1-f
W1211 12:18:26.507] Trying to find master named 'bootstrap-e2e-master'
W1211 12:18:26.507] Looking for address 'bootstrap-e2e-master-ip'
I1211 12:18:26.608] Setting up for KUBERNETES_PROVIDER="gce".
W1211 12:18:27.388] Using master: bootstrap-e2e-master (external IP: 35.225.8.199)
I1211 12:18:28.652] Dec 11 12:18:28.652: INFO: Overriding default scale value of zero to 1
I1211 12:18:28.653] Dec 11 12:18:28.652: INFO: Overriding default milliseconds value of zero to 5000
I1211 12:18:28.777] I1211 12:18:28.776762    5867 e2e.go:384] Starting e2e run "64fefedf-de6d-11e7-9b62-0a580a3d0e17" on Ginkgo node 1
I1211 12:18:28.803] Running Suite: Kubernetes e2e suite
I1211 12:18:28.804] ===================================
I1211 12:18:28.804] Random Seed: 1512994707 - Will randomize all specs
I1211 12:18:28.804] Will run 1 of 699 specs

From this job's log (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/893?log#log):

W1212 01:41:20.197] 2017/12/12 01:41:20 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1212 01:41:20.199] Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Network Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Zone: us-central1-f
W1212 01:41:20.200] Trying to find master named 'bootstrap-e2e-master'
W1212 01:41:20.200] Looking for address 'bootstrap-e2e-master-ip'
I1212 01:41:20.301] Setting up for KUBERNETES_PROVIDER="gce".
W1212 01:41:21.064] Using master: bootstrap-e2e-master (external IP: 35.202.181.15)
I1212 01:41:24.401] Running Suite: Kubernetes e2e suite
I1212 01:41:24.401] ===================================
I1212 01:41:24.402] Random Seed: 1513042881 - Will randomize all specs
I1212 01:41:24.403] Will run 699 specs

What worries me is the last line. For some reason, this is running every e2e test we have, which just won't work.

edit: config is here https://github.com/kubernetes/test-infra/blob/master/jobs/config.json#L2906

k8s-github-robot · 2017-12-12T18:42:31Z

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@spiffxp @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

enisoc · 2017-12-12T20:23:00Z

@BenTheElder any ideas on the above? -^

krousey · 2017-12-12T21:49:32Z

This was a wild goose chase. That message doesn't mean it's running all the specs... it's just the reporting is changed slightly for parallel runs... I think.

BenTheElder · 2017-12-12T22:39:01Z

ACK, meetings all morning, catching up on these things now. I think this probably was flipping on parallel actually, @krzyzacy can you confirm?

BenTheElder · 2017-12-12T23:23:48Z

We've (@krousey wrote, I just deployed) rolled out a change that hopefully will be safe and flip these to not run in parallel. It should take effect on any future runs.

krousey · 2017-12-12T23:36:43Z

Just to clarify @BenTheElder 's update. The downgrade step won't run in parallel. The tests that follow will still honor the parallel flag.

krousey · 2017-12-13T00:03:59Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/903 getting much better logs now.

krousey · 2017-12-13T00:39:16Z

Ok from the new logs, I can see that the parallel and non-parallel jobs are getting hung on the same points now. And also helped me quickly debug that my latest fix wasn't sufficient for the test environment.

krzyzacy · 2017-12-13T01:44:53Z

thanks @krousey !

krousey · 2017-12-13T04:40:42Z

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/904 succesfully downgraded. Also, all tests passed. If this continues overnight, I say we close this issue.

xiangpengzhao · 2017-12-13T05:53:45Z

@krousey awesome!
We should also wait for https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster to turn green. But I believe it will :)

krousey · 2017-12-13T14:55:44Z

That has a separate tracking issue. No need to wait for it.

…

On Dec 12, 2017 21:53, "Peter (XiangPeng) Zhao" ***@***.***> wrote: @krousey <https://github.com/krousey> awesome! We should also wait for https://k8s-testgrid.appspot. com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster to turn green. But I believe it will :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#56879 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJlm3fkOfgbaMBysF5E24yM8W4wH8_4ks5s_2ZtgaJpZM4Q3Tpf> .

xiangpengzhao · 2017-12-13T15:21:42Z

SGTM :)

spiffxp · 2017-12-13T15:46:50Z

/close
OK I've seen a few successful downgrades, and here's a full green run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/910

Thank you all

spiffxp added this to the v1.9 milestone Dec 6, 2017

k8s-github-robot added the milestone/needs-attention label Dec 6, 2017

jberkus mentioned this issue Dec 10, 2017

[1.9] Issue Burndown kubernetes/sig-release#38

Closed

krousey mentioned this issue Dec 12, 2017

Make it so the upgrade step of a test never runs in parallel kubernetes/test-infra#5917

Merged

BenTheElder mentioned this issue Dec 12, 2017

Bumpity bump upgrade test not parallel kubernetes/test-infra#5918

Merged

krousey mentioned this issue Dec 13, 2017

Need to use the test version of env vars to pin etcd kubernetes/test-infra#5920

Merged

krzyzacy mentioned this issue Dec 13, 2017

gce-1.9-1.8-downgrade fails due to etcd crashloopbacking #57013

Closed

xiangpengzhao mentioned this issue Dec 13, 2017

Don't downgrade etcd version when downgrade cluster to 1.8. #57108

Closed

k8s-ci-robot closed this as completed Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

spiffxp commented Dec 6, 2017

spiffxp commented Dec 11, 2017

krousey commented Dec 12, 2017 •

edited

Loading

k8s-github-robot commented Dec 12, 2017

enisoc commented Dec 12, 2017

krousey commented Dec 12, 2017 •

edited

Loading

BenTheElder commented Dec 12, 2017

BenTheElder commented Dec 12, 2017

krousey commented Dec 12, 2017

krousey commented Dec 13, 2017

krousey commented Dec 13, 2017

krzyzacy commented Dec 13, 2017

krousey commented Dec 13, 2017

xiangpengzhao commented Dec 13, 2017

krousey commented Dec 13, 2017 via email

xiangpengzhao commented Dec 13, 2017

spiffxp commented Dec 13, 2017 •

edited

Loading

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

Comments

spiffxp commented Dec 6, 2017

spiffxp commented Dec 11, 2017

krousey commented Dec 12, 2017 • edited Loading

k8s-github-robot commented Dec 12, 2017

enisoc commented Dec 12, 2017

krousey commented Dec 12, 2017 • edited Loading

BenTheElder commented Dec 12, 2017

BenTheElder commented Dec 12, 2017

krousey commented Dec 12, 2017

krousey commented Dec 13, 2017

krousey commented Dec 13, 2017

krzyzacy commented Dec 13, 2017

krousey commented Dec 13, 2017

xiangpengzhao commented Dec 13, 2017

krousey commented Dec 13, 2017 via email

xiangpengzhao commented Dec 13, 2017

spiffxp commented Dec 13, 2017 • edited Loading

krousey commented Dec 12, 2017 •

edited

Loading

krousey commented Dec 12, 2017 •

edited

Loading

spiffxp commented Dec 13, 2017 •

edited

Loading