Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

Closed
spiffxp opened this issue Dec 6, 2017 · 16 comments
Closed

[job failure] gce-master-1.8-downgrade-cluster-parallel #56879

spiffxp opened this issue Dec 6, 2017 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Dec 6, 2017

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
@kubernetes/sig-cluster-lifecycle-test-failures

This job has been failing since at least 2017-11-21. It's on the sig-release-master-upgrade dashboard,
and prevents us from cutting [v1.9.0-beta.2] (kubernetes/sig-release#39). Is there work ongoing to bring this job back to green?

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster-parallel

kubetest --timeout triggered
@spiffxp spiffxp added this to the v1.9 milestone Dec 6, 2017
@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/bug Categorizes issue or PR as related to a bug. labels Dec 6, 2017
@spiffxp
Copy link
Member Author

spiffxp commented Dec 11, 2017

Now tracking against v1.9.0 (kubernetes/sig-release#40)

All automated downgrade jobs are failing, this could really use some attention

Maybe same issue as #56244 ?

@krousey
Copy link
Contributor

krousey commented Dec 12, 2017

I think I've fixed issues with the non-parallel one (both node and master downgrade failures), but this seems weird. I think there's an error in how it's configured.

From the normal downgrade (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster/178?log#log):

W1211 12:18:26.502] 2017/12/11 12:18:26 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1211 12:18:26.506] Project: kubernetes-es-logging
W1211 12:18:26.506] Network Project: kubernetes-es-logging
W1211 12:18:26.506] Zone: us-central1-f
W1211 12:18:26.507] Trying to find master named 'bootstrap-e2e-master'
W1211 12:18:26.507] Looking for address 'bootstrap-e2e-master-ip'
I1211 12:18:26.608] Setting up for KUBERNETES_PROVIDER="gce".
W1211 12:18:27.388] Using master: bootstrap-e2e-master (external IP: 35.225.8.199)
I1211 12:18:28.652] Dec 11 12:18:28.652: INFO: Overriding default scale value of zero to 1
I1211 12:18:28.653] Dec 11 12:18:28.652: INFO: Overriding default milliseconds value of zero to 5000
I1211 12:18:28.777] I1211 12:18:28.776762    5867 e2e.go:384] Starting e2e run "64fefedf-de6d-11e7-9b62-0a580a3d0e17" on Ginkgo node 1
I1211 12:18:28.803] Running Suite: Kubernetes e2e suite
I1211 12:18:28.804] ===================================
I1211 12:18:28.804] Random Seed: 1512994707 - Will randomize all specs
I1211 12:18:28.804] Will run 1 of 699 specs

From this job's log (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/893?log#log):

W1212 01:41:20.197] 2017/12/12 01:41:20 util.go:155: Running: ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:ClusterDowngrade\] --upgrade-target=ci/k8s-stable1 --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade
W1212 01:41:20.199] Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Network Project: k8s-jkns-e2e-gce-gci
W1212 01:41:20.200] Zone: us-central1-f
W1212 01:41:20.200] Trying to find master named 'bootstrap-e2e-master'
W1212 01:41:20.200] Looking for address 'bootstrap-e2e-master-ip'
I1212 01:41:20.301] Setting up for KUBERNETES_PROVIDER="gce".
W1212 01:41:21.064] Using master: bootstrap-e2e-master (external IP: 35.202.181.15)
I1212 01:41:24.401] Running Suite: Kubernetes e2e suite
I1212 01:41:24.401] ===================================
I1212 01:41:24.402] Random Seed: 1513042881 - Will randomize all specs
I1212 01:41:24.403] Will run 699 specs

What worries me is the last line. For some reason, this is running every e2e test we have, which just won't work.

edit: config is here https://github.com/kubernetes/test-infra/blob/master/jobs/config.json#L2906

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@spiffxp @kubernetes/sig-cluster-lifecycle-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@enisoc
Copy link
Member

enisoc commented Dec 12, 2017

@BenTheElder any ideas on the above? -^

@krousey
Copy link
Contributor

krousey commented Dec 12, 2017

This was a wild goose chase. That message doesn't mean it's running all the specs... it's just the reporting is changed slightly for parallel runs... I think.

@BenTheElder
Copy link
Member

ACK, meetings all morning, catching up on these things now. I think this probably was flipping on parallel actually, @krzyzacy can you confirm?

@BenTheElder
Copy link
Member

We've (@krousey wrote, I just deployed) rolled out a change that hopefully will be safe and flip these to not run in parallel. It should take effect on any future runs.

@krousey
Copy link
Contributor

krousey commented Dec 12, 2017

Just to clarify @BenTheElder 's update. The downgrade step won't run in parallel. The tests that follow will still honor the parallel flag.

@krousey
Copy link
Contributor

krousey commented Dec 13, 2017

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/903 getting much better logs now.

@krousey
Copy link
Contributor

krousey commented Dec 13, 2017

Ok from the new logs, I can see that the parallel and non-parallel jobs are getting hung on the same points now. And also helped me quickly debug that my latest fix wasn't sufficient for the test environment.

@krzyzacy
Copy link
Member

thanks @krousey !

@krousey
Copy link
Contributor

krousey commented Dec 13, 2017

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/904 succesfully downgraded. Also, all tests passed. If this continues overnight, I say we close this issue.

@xiangpengzhao
Copy link
Contributor

@krousey awesome!
We should also wait for https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gce-master-1.8-downgrade-cluster to turn green. But I believe it will :)

@krousey
Copy link
Contributor

krousey commented Dec 13, 2017 via email

@xiangpengzhao
Copy link
Contributor

SGTM :)

@spiffxp
Copy link
Member Author

spiffxp commented Dec 13, 2017

/close
OK I've seen a few successful downgrades, and here's a full green run https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-master-new-downgrade-cluster-parallel/910

Thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

8 participants