Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

Closed
spiffxp opened this issue Dec 11, 2017 · 30 comments · Fixed by #57172
Assignees
Labels
area/provider/gcp Issues or PRs related to gcp provider kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Dec 11, 2017

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
/area platform/gke
@kubernetes/sig-instrumentation-test-failures
owns the test

This test has been failing since at least 2017-11-30 for the following job:

This job is on the sig-release-master-upgrade dashboard, and prevents us from cutting v1.9.0 (kubernetes/sig-release#40). Is there work ongoing to bring this test back to green?

/assign @crassirostris @janetkuo
Pulling this out of #56426 (comment) into its own issue. It might be GKE specific? But the GCE downgrade jobs are failing right now, so I don't have enough data to say for sure

@spiffxp spiffxp added this to the v1.9 milestone Dec 11, 2017
@k8s-ci-robot k8s-ci-robot added status/approved-for-milestone priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/bug Categorizes issue or PR as related to a bug. area/provider/gcp Issues or PRs related to gcp provider labels Dec 11, 2017
@janetkuo
Copy link
Member

janetkuo commented Dec 11, 2017

Copying latest findings here #56426 (comment):

This is a problem with addon manager or DaemonSet controller:
1.9: fluentd DaemonSet in version 2.0.10 is using ConfigMap in version 1.2.3
1.8: fluentd DaemonSet in version 2.0.9 is using ConfigMap in version 1.2.2
I see at some point there are pods from both DaemonSets on the node:

I1205 09:30:13.798109    1389 kubelet.go:1837] SyncLoop (ADD, "api"): "fluentd-gcp-v2.0.9-r9ggr_kube-system(e2b9c0b9-d99e-11e7-a970-42010a800009)"
I1205 09:30:13.802068    1389 kubelet.go:1837] SyncLoop (ADD, "api"): "fluentd-gcp-v2.0.10-5jmf5_kube-system(e2b990cb-d99e-11e7-a970-42010a800009)"

@crassirostris It seems that both DaemonSets are deployed? If so, DaemonSet controller is not at fault. It's more likely to be something with addon manager. It should only deploy one fluentd DaemonSet.

@crassirostris crassirostris added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Dec 11, 2017
@crassirostris
Copy link

@janetkuo Agree, seems like a problem with addon manager

/cc @mikedanese @roberthbailey @k8s-mirror-cluster-lifecycle-bugs

@janetkuo janetkuo removed their assignment Dec 11, 2017
@MrHohn
Copy link
Member

MrHohn commented Dec 11, 2017

At this point it seems somehow vague to tell which exact component is causing the issue. Probably worth to solve the fundamental downgrade issue (#57013) first.

@dims
Copy link
Member

dims commented Dec 12, 2017

@crassirostris
Copy link

@dims Test doesn't install anything, there's a valid error in addon manager, which results in two addon DaemonSets running after the downgrade

Flunetd is there since forever, as DaemonSet for 3-4 releases already, nothing has changed in it, except for the version, which should be handled adequately by addon manager, but it's not

@dims
Copy link
Member

dims commented Dec 13, 2017

Ack @crassirostris thanks!

@dims
Copy link
Member

dims commented Dec 13, 2017

Another clue a better one i hope, grepping through the UUID(s) for the fluentd containers - http://paste.openstack.org/raw/628792/ spotted the following

./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:I1212 18:50:02.049064    1372 kuberuntime_manager.go:371] No sandbox for pod "fluentd-gcp-v2.0.10-7b852_kube-system(3901f568-df6d-11e7-954a-42010a800003)" can be found. Need to start a new one

and trying to look for why it failed, saw the following

./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:40.139706    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.139799    1365 pod_workers.go:182] Error syncing pod ff7d4a40-df6a-11e7-954a-42010a800003 ("kube-dns-autoscaler-69c5cbdcdd-b2pzr_kube-system(ff7d4a40-df6a-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.140838    1365 pod_workers.go:182] Error syncing pod fc13a3a1-df6a-11e7-954a-42010a800003 ("event-exporter-v0.1.7-6f59b86c5b-9fz7p_kube-system(fc13a3a1-df6a-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141471    1365 pod_workers.go:182] Error syncing pod 59b5dff4-df6b-11e7-954a-42010a800003 ("foo-pmvqw_e2e-tests-sig-apps-job-upgrade-x5frz(59b5dff4-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141565    1365 pod_workers.go:182] Error syncing pod 00ca992b-df6b-11e7-954a-42010a800003 ("kubernetes-dashboard-57889f9586-vb5jf_kube-system(00ca992b-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141693    1365 pod_workers.go:182] Error syncing pod 08b0fade-df6b-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-pk4wj_kube-system(08b0fade-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141779    1365 pod_workers.go:182] Error syncing pod 59bba7f8-df6b-11e7-954a-42010a800003 ("apparmor-loader-8t7c5_e2e-tests-apparmor-upgrade-j2x4s(59bba7f8-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141876    1365 pod_workers.go:182] Error syncing pod 59be09af-df6b-11e7-954a-42010a800003 ("ds1-p5j8p_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(59be09af-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141950    1365 pod_workers.go:182] Error syncing pod 5fb0c319-df6b-11e7-954a-42010a800003 ("res-cons-upgrade-ctrl-fng47_e2e-tests-hpa-upgrade-qsxs2(5fb0c319-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142022    1365 pod_workers.go:182] Error syncing pod 5bf71f2d-df6b-11e7-954a-42010a800003 ("echoheaders-https-8f4n2_e2e-tests-ingress-upgrade-vrmqm(5bf71f2d-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142084    1365 pod_workers.go:182] Error syncing pod 5d570514-df6b-11e7-954a-42010a800003 ("test-apparmor-l9wp6_e2e-tests-apparmor-upgrade-j2x4s(5d570514-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142187    1365 pod_workers.go:182] Error syncing pod 5e7fab1b-df6b-11e7-954a-42010a800003 ("deployment-hash-test-59d9668d4c-snvxk_e2e-tests-sig-apps-deployment-upgrade-n2ftb(5e7fab1b-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142335    1365 pod_workers.go:182] Error syncing pod 73f7f092-df6b-11e7-954a-42010a800003 ("service-test-sljmv_e2e-tests-service-upgrade-wn4g5(73f7f092-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142449    1365 pod_workers.go:182] Error syncing pod 9e70f237-df6b-11e7-954a-42010a800003 ("res-cons-upgrade-8f8d2_e2e-tests-hpa-upgrade-qsxs2(9e70f237-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142689    1365 pod_workers.go:182] Error syncing pod 2df49c13-df6c-11e7-954a-42010a800003 ("res-cons-upgrade-llrp8_e2e-tests-hpa-upgrade-qsxs2(2df49c13-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.439080    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:50.440793    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:16.806626    1389 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:21.807212    1389 pod_workers.go:182] Error syncing pod e0391aa9-df6c-11e7-954a-42010a800003 ("ds1-t2tl9_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(e0391aa9-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:21.807033    1389 pod_workers.go:182] Error syncing pod e038ddce-df6c-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-72q6b_kube-system(e038ddce-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:22.009168    1389 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:45.733332    1372 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.736129    1372 pod_workers.go:182] Error syncing pod 39061ddf-df6d-11e7-954a-42010a800003 ("ds1-kmspw_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(39061ddf-df6d-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.738713    1372 pod_workers.go:182] Error syncing pod 3901f568-df6d-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-7b852_kube-system(3901f568-df6d-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.951623    1372 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

These logs are from:
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster-parallel/1069/

@dims
Copy link
Member

dims commented Dec 13, 2017

/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 13, 2017
@freehan
Copy link
Contributor

freehan commented Dec 13, 2017

This log line usually shows up when kubelet restarts. It should recover after a while.

 Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

@janetkuo
Copy link
Member

Can someone from sig-node investigate why kubelet was restarted and if it's normal? @kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 13, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@crassirostris @spiffxp @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-instrumentation-misc @kubernetes/sig-network-misc @kubernetes/sig-node-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
  • sig/cluster-lifecycle sig/instrumentation sig/network sig/node: Issue will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@janetkuo
Copy link
Member

@dchen1107 will take a look at kubelet restarts to see if that's normal

@MrHohn
Copy link
Member

MrHohn commented Dec 13, 2017

I think I got the root cause.

Starting from addon-manager's log of the current run (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster/563):

INFO: == Kubernetes addon manager started at 2017-12-13T20:33:26+0000 with ADDON_CHECK_INTERVAL_SEC=60 ==
INFO: == Default service account in the kube-system namespace has token default-token-rqmlm ==
INFO: ++ obj /etc/kubernetes/admission-controls/limit-range/limit-range.yaml is created ++
INFO: == Entering periodical apply loop at 2017-12-13T20:33:28+0000 ==
namespace "kube-system" configured
INFO: == Successfully started /opt/namespace.yaml in namespace  at 2017-12-13T20:33:29+0000
INFO: Leader is gke-xxx
INFO: Not elected leader, going back to sleep.
limitrange "limits" configured
INFO: == Successfully started /etc/kubernetes/admission-controls/limit-range/limit-range.yaml in namespace default at 2017-12-13T20:33:29+0000
INFO: Leader is gke-xxx
INFO: == Kubernetes addon ensure completed at 2017-12-13T20:34:31+0000 ==
INFO: == Reconciling with deprecated label ==
deployment "heapster-v1.4.3" created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
The ClusterRoleBinding "kubelet-cluster-admin" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system:node"}: cannot change roleRef
daemonset "fluentd-gcp-v2.0.9" created
INFO: == Reconciling with addon-manager label ==
configmap "fluentd-gcp-config-v1.2.2" created
configmap "fluentd-gcp-config-v1.2.3" pruned
INFO: == Kubernetes addon reconcile completed at 2017-12-13T20:34:35+0000 ==
INFO: Leader is gke-xxx
INFO: == Kubernetes addon ensure completed at 2017-12-13T20:35:30+0000 ==
INFO: == Reconciling with deprecated label ==
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
The ClusterRoleBinding "kubelet-cluster-admin" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system:node"}: cannot change roleRef
INFO: == Reconciling with addon-manager label ==
INFO: == Kubernetes addon reconcile completed at 2017-12-13T20:35:34+0000 ==
...

So clearly something changed for ClusterRoleBinding "kubelet-cluster-admin" between 1.8 and 1.9, upon the downgrade, addonmanager tried to apply the change to an immutable field and failed. Due to some implementation details in kubectl apply --prune, when any error occurred during create/update, the prune operation will not be performed. Hence fluentd-gcp-v2.0.9 daemonset was created but fluentd-gcp-v2.0.10 daemonset was not pruned. Also due to some implementation details in addonmanager, fluentd daemonset and fluentd confimap are managed in seperate label groups, which explains why fluentd-gcp-config-v1.2.3 was pruned successfully.

@cjcullen might know about the ClusterRoleBinding "kubelet-cluster-admin" change?

@enisoc
Copy link
Member

enisoc commented Dec 13, 2017

@MrHohn
Copy link
Member

MrHohn commented Dec 14, 2017

@enisoc
Copy link
Member

enisoc commented Dec 14, 2017

It seems related to #53144.

new clusters will pickup the new binding and old clusters will keep the old binding

In this case, we create a new cluster, which gets the new binding. Then we try to downgrade, which ends up attempting to switch back to the old binding.

I wonder if we can cherry-pick something to 1.8 to make it keep the new binding upon downgrade?

@enisoc
Copy link
Member

enisoc commented Dec 14, 2017

cc @liggitt and @tallclair who reviewed the above PR.

@MrHohn
Copy link
Member

MrHohn commented Dec 14, 2017

I wonder if we can cherry-pick something to 1.8 to make it keep the new binding upon downgrade?

From addonmanager perspective, one way to achieve this is by changing the Reconcile mode to EnsureExist mode.

addonmanager.kubernetes.io/mode: Reconcile

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

1.8 created kubelet-cluster-admin binding to system:node role in ReconcileMode

If we want 1.9 to tolerate that and not remove it, we should leave the kubelet-cluster-admin binding to system:node role with no subjects and EnsureExists mode

Separately in 1.9 we can create a new kubelet-bootstrapper binding to system:node-bootstrapper for the kubelet subject with reconcile mode

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

(Which is what it looks like https://github.com/kubernetes/kubernetes/pull/53144/files#diff-b701a165afa6442d4c7d06bf087d88fd did… the later change of what role it bound to was the breaking change)

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

962e1e2 should have left the existing binding alone and created a new binding for the new role

@enisoc
Copy link
Member

enisoc commented Dec 14, 2017

@liggitt Just to make sure we're on the same page: We didn't see any problem in 1.8->1.9 upgrade tests. This only shows up in the downgrade test where we start with a fresh 1.9 cluster and downgrade to 1.8. It sounds like your plan will help with both directions, but just want to be clear.

Can you work on a PR or do I need to find someone?

@dims
Copy link
Member

dims commented Dec 14, 2017

Nice work @MrHohn, So we don't really capture the addon manager logs do we? Looks like you had to pick one up from the running system, is that right?

@MrHohn
Copy link
Member

MrHohn commented Dec 14, 2017

So we don't really capture the addon manager logs do we? Looks like you had to pick one up from the running system, is that right?

@dims Yeah, unfortunately addon manager logs on GKE CIs is not publicly visible. There may be ways to retrieve the same logs for history runs, just that getting it from a running cluster seems easiest to me.

@dchen1107
Copy link
Member

Quickly looked at Kubelet and netconfig is not ready error message should be red herring. Right after that, all nodes are ready with the following logging:

E1212 18:44:50.440793    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
I1212 18:44:50.566710    1365 kuberuntime_manager.go:899] updating runtime config through cri with podcidr 10.52.3.0/24
I1212 18:44:50.567335    1365 docker_service.go:307] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:10.52.3.0/24,},}
I1212 18:44:50.567660    1365 kubenet_linux.go:265] CNI network config set to {
  "cniVersion": "0.1.0",
  "name": "kubenet",
  "type": "bridge",
  "bridge": "cbr0",
  "mtu": 1460,
  "addIf": "eth0",
  "isGateway": true,
  "ipMasq": false,
  "hairpinMode": false,
  "ipam": {
    "type": "host-local",
    "subnet": "10.52.3.0/24",
    "gateway": "10.52.3.1",
    "routes": [
      { "dst": "0.0.0.0/0" }
    ]
  }
}
I1212 18:44:50.571546    1365 kubelet_network.go:276] Setting Pod CIDR:  -> 10.52.3.0/24

Will look more.

@dchen1107
Copy link
Member

Ahh, after refreshing the issue, and found that @MrHohn already identified the root cause. Nice work and thanks!

@liggitt
Copy link
Member

liggitt commented Dec 14, 2017

We didn't see any problem in 1.8->1.9 upgrade tests.

The add-on manager would still have hit issues trying to reapply the changed rolebinding, but the 1.8 binding granted a superset of permissions, so we likely didn't notice.

This only shows up in the downgrade test where we start with a fresh 1.9 cluster and downgrade to 1.8. It sounds like your plan will help with both directions, but just want to be clear.

Correct

Can you work on a PR or do I need to find someone?

opened #57172

@dims
Copy link
Member

dims commented Dec 14, 2017

Thanks @dchen1107 at least my investigation triggered a deeper look which seems to have helped the cause. yay!

k8s-github-robot pushed a commit that referenced this issue Dec 14, 2017
Automatic merge from submit-queue (batch tested with PRs 57172, 55382, 56147, 56146, 56158). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

gce: split legacy kubelet node role binding and bootstrapper role binding

fixes issue upgrading 1.8->1.9 or downgrading 1.9->1.8

fixes #57047

```release-note
NONE
```
@enisoc enisoc reopened this Dec 14, 2017
@enisoc
Copy link
Member

enisoc commented Dec 14, 2017

@dims
Copy link
Member

dims commented Dec 14, 2017

Looks like we are green - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-beta-stable1-downgrade-cluster-parallel/368?log#log

@enisoc enisoc closed this as completed Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/gcp Issues or PRs related to gcp provider kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. milestone/needs-attention priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.