[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

spiffxp · 2017-12-11T17:45:54Z

/priority critical-urgent
/priority failing-test
/kind bug
/status approved-for-milestone
/area platform/gke
@kubernetes/sig-instrumentation-test-failures
owns the test

This test has been failing since at least 2017-11-30 for the following job:

https://k8s-testgrid.appspot.com/sig-release-master-upgrade#gke-gci-master-gci-1.8-downgrade-cluster-parallel

This job is on the sig-release-master-upgrade dashboard, and prevents us from cutting v1.9.0 (kubernetes/sig-release#40). Is there work ongoing to bring this test back to green?

/assign @crassirostris @janetkuo
Pulling this out of #56426 (comment) into its own issue. It might be GKE specific? But the GCE downgrade jobs are failing right now, so I don't have enough data to say for sure

janetkuo · 2017-12-11T17:57:58Z

Copying latest findings here #56426 (comment):

This is a problem with addon manager or DaemonSet controller:
1.9: fluentd DaemonSet in version 2.0.10 is using ConfigMap in version 1.2.3
1.8: fluentd DaemonSet in version 2.0.9 is using ConfigMap in version 1.2.2
I see at some point there are pods from both DaemonSets on the node:
I1205 09:30:13.798109    1389 kubelet.go:1837] SyncLoop (ADD, "api"): "fluentd-gcp-v2.0.9-r9ggr_kube-system(e2b9c0b9-d99e-11e7-a970-42010a800009)"
I1205 09:30:13.802068    1389 kubelet.go:1837] SyncLoop (ADD, "api"): "fluentd-gcp-v2.0.10-5jmf5_kube-system(e2b990cb-d99e-11e7-a970-42010a800009)"

@crassirostris It seems that both DaemonSets are deployed? If so, DaemonSet controller is not at fault. It's more likely to be something with addon manager. It should only deploy one fluentd DaemonSet.

crassirostris · 2017-12-11T18:13:15Z

@janetkuo Agree, seems like a problem with addon manager

/cc @mikedanese @roberthbailey @k8s-mirror-cluster-lifecycle-bugs

MrHohn · 2017-12-11T22:36:59Z

At this point it seems somehow vague to tell which exact component is causing the issue. Probably worth to solve the fundamental downgrade issue (#57013) first.

dims · 2017-12-12T22:42:56Z

Digging in a bit ... Looks like the test installs a Log Provider
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/instrumentation/logging/stackdrvier/basic.go#L46

using this:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/instrumentation/logging/stackdrvier/utils.go#L374

and fails here as it finds 2 agents instead of just the one that it installs:
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/instrumentation/logging/utils/logging_agent.go#L50

Probably because i am guessing fluentd was switched on in this environment?
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster-parallel/1069/artifacts/nodes.yaml

This seems to have happened around 11/30, ring any bells to anyone?

crassirostris · 2017-12-12T22:51:46Z

@dims Test doesn't install anything, there's a valid error in addon manager, which results in two addon DaemonSets running after the downgrade

Flunetd is there since forever, as DaemonSet for 3-4 releases already, nothing has changed in it, except for the version, which should be handled adequately by addon manager, but it's not

dims · 2017-12-13T02:25:01Z

Ack @crassirostris thanks!

dims · 2017-12-13T04:07:54Z

Another clue a better one i hope, grepping through the UUID(s) for the fluentd containers - http://paste.openstack.org/raw/628792/ spotted the following

./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:I1212 18:50:02.049064    1372 kuberuntime_manager.go:371] No sandbox for pod "fluentd-gcp-v2.0.10-7b852_kube-system(3901f568-df6d-11e7-954a-42010a800003)" can be found. Need to start a new one

and trying to look for why it failed, saw the following

./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:40.139706    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.139799    1365 pod_workers.go:182] Error syncing pod ff7d4a40-df6a-11e7-954a-42010a800003 ("kube-dns-autoscaler-69c5cbdcdd-b2pzr_kube-system(ff7d4a40-df6a-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.140838    1365 pod_workers.go:182] Error syncing pod fc13a3a1-df6a-11e7-954a-42010a800003 ("event-exporter-v0.1.7-6f59b86c5b-9fz7p_kube-system(fc13a3a1-df6a-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141471    1365 pod_workers.go:182] Error syncing pod 59b5dff4-df6b-11e7-954a-42010a800003 ("foo-pmvqw_e2e-tests-sig-apps-job-upgrade-x5frz(59b5dff4-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141565    1365 pod_workers.go:182] Error syncing pod 00ca992b-df6b-11e7-954a-42010a800003 ("kubernetes-dashboard-57889f9586-vb5jf_kube-system(00ca992b-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141693    1365 pod_workers.go:182] Error syncing pod 08b0fade-df6b-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-pk4wj_kube-system(08b0fade-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141779    1365 pod_workers.go:182] Error syncing pod 59bba7f8-df6b-11e7-954a-42010a800003 ("apparmor-loader-8t7c5_e2e-tests-apparmor-upgrade-j2x4s(59bba7f8-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141876    1365 pod_workers.go:182] Error syncing pod 59be09af-df6b-11e7-954a-42010a800003 ("ds1-p5j8p_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(59be09af-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.141950    1365 pod_workers.go:182] Error syncing pod 5fb0c319-df6b-11e7-954a-42010a800003 ("res-cons-upgrade-ctrl-fng47_e2e-tests-hpa-upgrade-qsxs2(5fb0c319-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142022    1365 pod_workers.go:182] Error syncing pod 5bf71f2d-df6b-11e7-954a-42010a800003 ("echoheaders-https-8f4n2_e2e-tests-ingress-upgrade-vrmqm(5bf71f2d-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142084    1365 pod_workers.go:182] Error syncing pod 5d570514-df6b-11e7-954a-42010a800003 ("test-apparmor-l9wp6_e2e-tests-apparmor-upgrade-j2x4s(5d570514-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142187    1365 pod_workers.go:182] Error syncing pod 5e7fab1b-df6b-11e7-954a-42010a800003 ("deployment-hash-test-59d9668d4c-snvxk_e2e-tests-sig-apps-deployment-upgrade-n2ftb(5e7fab1b-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142335    1365 pod_workers.go:182] Error syncing pod 73f7f092-df6b-11e7-954a-42010a800003 ("service-test-sljmv_e2e-tests-service-upgrade-wn4g5(73f7f092-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142449    1365 pod_workers.go:182] Error syncing pod 9e70f237-df6b-11e7-954a-42010a800003 ("res-cons-upgrade-8f8d2_e2e-tests-hpa-upgrade-qsxs2(9e70f237-df6b-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.142689    1365 pod_workers.go:182] Error syncing pod 2df49c13-df6c-11e7-954a-42010a800003 ("res-cons-upgrade-llrp8_e2e-tests-hpa-upgrade-qsxs2(2df49c13-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:45.439080    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-f6w3/kubelet.log:E1212 18:44:50.440793    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:16.806626    1389 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:21.807212    1389 pod_workers.go:182] Error syncing pod e0391aa9-df6c-11e7-954a-42010a800003 ("ds1-t2tl9_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(e0391aa9-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:21.807033    1389 pod_workers.go:182] Error syncing pod e038ddce-df6c-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-72q6b_kube-system(e038ddce-df6c-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-n7bx/kubelet.log:E1212 18:47:22.009168    1389 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:45.733332    1372 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.736129    1372 pod_workers.go:182] Error syncing pod 39061ddf-df6d-11e7-954a-42010a800003 ("ds1-kmspw_e2e-tests-sig-apps-daemonset-upgrade-r9bqx(39061ddf-df6d-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.738713    1372 pod_workers.go:182] Error syncing pod 3901f568-df6d-11e7-954a-42010a800003 ("fluentd-gcp-v2.0.10-7b852_kube-system(3901f568-df6d-11e7-954a-42010a800003)"), skipping: network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
./gke-bootstrap-e2e-default-pool-12112aef-xtrv/kubelet.log:E1212 18:49:50.951623    1372 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

These logs are from:
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster-parallel/1069/

dims · 2017-12-13T04:12:43Z

/sig network

freehan · 2017-12-13T19:39:23Z

This log line usually shows up when kubelet restarts. It should recover after a while.

 Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR

janetkuo · 2017-12-13T22:43:18Z

Can someone from sig-node investigate why kubelet was restarted and if it's normal? @kubernetes/sig-node-bugs

k8s-github-robot · 2017-12-13T22:43:33Z

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@crassirostris @spiffxp @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-instrumentation-misc @kubernetes/sig-network-misc @kubernetes/sig-node-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/cluster-lifecycle sig/instrumentation sig/network sig/node: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

janetkuo · 2017-12-13T23:35:20Z

@dchen1107 will take a look at kubelet restarts to see if that's normal

MrHohn · 2017-12-13T23:42:17Z

I think I got the root cause.

Starting from addon-manager's log of the current run (https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-gci-master-gci-new-downgrade-cluster/563):

INFO: == Kubernetes addon manager started at 2017-12-13T20:33:26+0000 with ADDON_CHECK_INTERVAL_SEC=60 ==
INFO: == Default service account in the kube-system namespace has token default-token-rqmlm ==
INFO: ++ obj /etc/kubernetes/admission-controls/limit-range/limit-range.yaml is created ++
INFO: == Entering periodical apply loop at 2017-12-13T20:33:28+0000 ==
namespace "kube-system" configured
INFO: == Successfully started /opt/namespace.yaml in namespace  at 2017-12-13T20:33:29+0000
INFO: Leader is gke-xxx
INFO: Not elected leader, going back to sleep.
limitrange "limits" configured
INFO: == Successfully started /etc/kubernetes/admission-controls/limit-range/limit-range.yaml in namespace default at 2017-12-13T20:33:29+0000
INFO: Leader is gke-xxx
INFO: == Kubernetes addon ensure completed at 2017-12-13T20:34:31+0000 ==
INFO: == Reconciling with deprecated label ==
deployment "heapster-v1.4.3" created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
The ClusterRoleBinding "kubelet-cluster-admin" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system:node"}: cannot change roleRef
daemonset "fluentd-gcp-v2.0.9" created
INFO: == Reconciling with addon-manager label ==
configmap "fluentd-gcp-config-v1.2.2" created
configmap "fluentd-gcp-config-v1.2.3" pruned
INFO: == Kubernetes addon reconcile completed at 2017-12-13T20:34:35+0000 ==
INFO: Leader is gke-xxx
INFO: == Kubernetes addon ensure completed at 2017-12-13T20:35:30+0000 ==
INFO: == Reconciling with deprecated label ==
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
The ClusterRoleBinding "kubelet-cluster-admin" is invalid: roleRef: Invalid value: rbac.RoleRef{APIGroup:"rbac.authorization.k8s.io", Kind:"ClusterRole", Name:"system:node"}: cannot change roleRef
INFO: == Reconciling with addon-manager label ==
INFO: == Kubernetes addon reconcile completed at 2017-12-13T20:35:34+0000 ==
...

So clearly something changed for ClusterRoleBinding "kubelet-cluster-admin" between 1.8 and 1.9, upon the downgrade, addonmanager tried to apply the change to an immutable field and failed. Due to some implementation details in kubectl apply --prune, when any error occurred during create/update, the prune operation will not be performed. Hence fluentd-gcp-v2.0.9 daemonset was created but fluentd-gcp-v2.0.10 daemonset was not pruned. Also due to some implementation details in addonmanager, fluentd daemonset and fluentd confimap are managed in seperate label groups, which explains why fluentd-gcp-config-v1.2.3 was pruned successfully.

@cjcullen might know about the ClusterRoleBinding "kubelet-cluster-admin" change?

enisoc · 2017-12-13T23:55:14Z

cc @mikedanese who last touched these files:

https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/rbac

MrHohn · 2017-12-14T00:02:21Z

Seems like ClusterRoleBinding kubelet-cluster-admin does change a bit in the legacy-kubelet-user-disable case:

enisoc · 2017-12-14T00:05:43Z

It seems related to #53144.

new clusters will pickup the new binding and old clusters will keep the old binding

In this case, we create a new cluster, which gets the new binding. Then we try to downgrade, which ends up attempting to switch back to the old binding.

I wonder if we can cherry-pick something to 1.8 to make it keep the new binding upon downgrade?

enisoc · 2017-12-14T00:06:42Z

cc @liggitt and @tallclair who reviewed the above PR.

MrHohn · 2017-12-14T00:10:17Z

I wonder if we can cherry-pick something to 1.8 to make it keep the new binding upon downgrade?

From addonmanager perspective, one way to achieve this is by changing the Reconcile mode to EnsureExist mode.

kubernetes/cluster/addons/rbac/kubelet-binding.yaml

Line 11 in d23df05

addonmanager.kubernetes.io/mode: Reconcile

liggitt · 2017-12-14T00:16:47Z

1.8 created kubelet-cluster-admin binding to system:node role in ReconcileMode

If we want 1.9 to tolerate that and not remove it, we should leave the kubelet-cluster-admin binding to system:node role with no subjects and EnsureExists mode

Separately in 1.9 we can create a new kubelet-bootstrapper binding to system:node-bootstrapper for the kubelet subject with reconcile mode

liggitt · 2017-12-14T00:18:25Z

(Which is what it looks like https://github.com/kubernetes/kubernetes/pull/53144/files#diff-b701a165afa6442d4c7d06bf087d88fd did… the later change of what role it bound to was the breaking change)

liggitt · 2017-12-14T00:20:11Z

962e1e2 should have left the existing binding alone and created a new binding for the new role

enisoc · 2017-12-14T00:28:51Z

@liggitt Just to make sure we're on the same page: We didn't see any problem in 1.8->1.9 upgrade tests. This only shows up in the downgrade test where we start with a fresh 1.9 cluster and downgrade to 1.8. It sounds like your plan will help with both directions, but just want to be clear.

Can you work on a PR or do I need to find someone?

dims · 2017-12-14T01:02:46Z

Nice work @MrHohn, So we don't really capture the addon manager logs do we? Looks like you had to pick one up from the running system, is that right?

MrHohn · 2017-12-14T02:01:26Z

So we don't really capture the addon manager logs do we? Looks like you had to pick one up from the running system, is that right?

@dims Yeah, unfortunately addon manager logs on GKE CIs is not publicly visible. There may be ways to retrieve the same logs for history runs, just that getting it from a running cluster seems easiest to me.

dchen1107 · 2017-12-14T02:48:56Z

Quickly looked at Kubelet and netconfig is not ready error message should be red herring. Right after that, all nodes are ready with the following logging:

E1212 18:44:50.440793    1365 kubelet.go:2095] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR
I1212 18:44:50.566710    1365 kuberuntime_manager.go:899] updating runtime config through cri with podcidr 10.52.3.0/24
I1212 18:44:50.567335    1365 docker_service.go:307] docker cri received runtime config &RuntimeConfig{NetworkConfig:&NetworkConfig{PodCidr:10.52.3.0/24,},}
I1212 18:44:50.567660    1365 kubenet_linux.go:265] CNI network config set to {
  "cniVersion": "0.1.0",
  "name": "kubenet",
  "type": "bridge",
  "bridge": "cbr0",
  "mtu": 1460,
  "addIf": "eth0",
  "isGateway": true,
  "ipMasq": false,
  "hairpinMode": false,
  "ipam": {
    "type": "host-local",
    "subnet": "10.52.3.0/24",
    "gateway": "10.52.3.1",
    "routes": [
      { "dst": "0.0.0.0/0" }
    ]
  }
}
I1212 18:44:50.571546    1365 kubelet_network.go:276] Setting Pod CIDR:  -> 10.52.3.0/24

Will look more.

dchen1107 · 2017-12-14T02:50:03Z

Ahh, after refreshing the issue, and found that @MrHohn already identified the root cause. Nice work and thanks!

liggitt · 2017-12-14T02:57:17Z

We didn't see any problem in 1.8->1.9 upgrade tests.

The add-on manager would still have hit issues trying to reapply the changed rolebinding, but the 1.8 binding granted a superset of permissions, so we likely didn't notice.

This only shows up in the downgrade test where we start with a fresh 1.9 cluster and downgrade to 1.8. It sounds like your plan will help with both directions, but just want to be clear.

Correct

Can you work on a PR or do I need to find someone?

opened #57172

dims · 2017-12-14T03:07:33Z

Thanks @dchen1107 at least my investigation triggered a deeper look which seems to have helped the cause. yay!

Automatic merge from submit-queue (batch tested with PRs 57172, 55382, 56147, 56146, 56158). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. gce: split legacy kubelet node role binding and bootstrapper role binding fixes issue upgrading 1.8->1.9 or downgrading 1.9->1.8 fixes #57047 ```release-note NONE ```

enisoc · 2017-12-14T20:46:07Z

Reopening until we confirm the fix worked:

https://k8s-testgrid.appspot.com/sig-release-1.9-all#gke-1.9-1.8-downgrade-parallel&width=80

dims · 2017-12-14T22:13:52Z

Looks like we are green - https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-beta-stable1-downgrade-cluster-parallel/368?log#log

spiffxp added this to the v1.9 milestone Dec 11, 2017

k8s-ci-robot assigned crassirostris and janetkuo Dec 11, 2017

k8s-github-robot added the milestone/needs-attention label Dec 11, 2017

spiffxp mentioned this issue Dec 11, 2017

[e2e failure] [sig-cluster-lifecycle] Downgrade [Feature:Downgrade] cluster downgrade should maintain a functioning cluster [Feature:ClusterDowngrade] #56426

Closed

crassirostris added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Dec 11, 2017

janetkuo removed their assignment Dec 11, 2017

jberkus mentioned this issue Dec 12, 2017

[1.9] Issue Burndown kubernetes/sig-release#38

Closed

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 13, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 13, 2017

liggitt mentioned this issue Dec 14, 2017

gce: split legacy kubelet node role binding and bootstrapper role binding #57172

Merged

k8s-github-robot closed this as completed in #57172 Dec 14, 2017

enisoc reopened this Dec 14, 2017

enisoc closed this as completed Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

spiffxp commented Dec 11, 2017

janetkuo commented Dec 11, 2017 •

edited

Loading

crassirostris commented Dec 11, 2017

MrHohn commented Dec 11, 2017

dims commented Dec 12, 2017

crassirostris commented Dec 12, 2017

dims commented Dec 13, 2017

dims commented Dec 13, 2017 •

edited

Loading

dims commented Dec 13, 2017

freehan commented Dec 13, 2017

janetkuo commented Dec 13, 2017

k8s-github-robot commented Dec 13, 2017

janetkuo commented Dec 13, 2017

MrHohn commented Dec 13, 2017 •

edited

Loading

enisoc commented Dec 13, 2017

MrHohn commented Dec 14, 2017

enisoc commented Dec 14, 2017

enisoc commented Dec 14, 2017

MrHohn commented Dec 14, 2017

liggitt commented Dec 14, 2017

liggitt commented Dec 14, 2017

liggitt commented Dec 14, 2017

enisoc commented Dec 14, 2017

dims commented Dec 14, 2017

MrHohn commented Dec 14, 2017

dchen1107 commented Dec 14, 2017

dchen1107 commented Dec 14, 2017

liggitt commented Dec 14, 2017 •

edited

Loading

dims commented Dec 14, 2017

enisoc commented Dec 14, 2017

dims commented Dec 14, 2017

[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

[e2e failure] [sig-instrumentation] Cluster level logging implemented by Stackdriver should ingest ... #57047

Comments

spiffxp commented Dec 11, 2017

janetkuo commented Dec 11, 2017 • edited Loading

crassirostris commented Dec 11, 2017

MrHohn commented Dec 11, 2017

dims commented Dec 12, 2017

crassirostris commented Dec 12, 2017

dims commented Dec 13, 2017

dims commented Dec 13, 2017 • edited Loading

dims commented Dec 13, 2017

freehan commented Dec 13, 2017

janetkuo commented Dec 13, 2017

k8s-github-robot commented Dec 13, 2017

janetkuo commented Dec 13, 2017

MrHohn commented Dec 13, 2017 • edited Loading

enisoc commented Dec 13, 2017

MrHohn commented Dec 14, 2017

enisoc commented Dec 14, 2017

enisoc commented Dec 14, 2017

MrHohn commented Dec 14, 2017

liggitt commented Dec 14, 2017

liggitt commented Dec 14, 2017

liggitt commented Dec 14, 2017

enisoc commented Dec 14, 2017

dims commented Dec 14, 2017

MrHohn commented Dec 14, 2017

dchen1107 commented Dec 14, 2017

dchen1107 commented Dec 14, 2017

liggitt commented Dec 14, 2017 • edited Loading

dims commented Dec 14, 2017

enisoc commented Dec 14, 2017

dims commented Dec 14, 2017

janetkuo commented Dec 11, 2017 •

edited

Loading

dims commented Dec 13, 2017 •

edited

Loading

MrHohn commented Dec 13, 2017 •

edited

Loading

liggitt commented Dec 14, 2017 •

edited

Loading