Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

ydye · 2019-10-14T12:16:31Z

Environment:

Cloud provider or hardware configuration: Azure VM
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):

Linux 4.15.0-1019-azure x86_64
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Version of Ansible (ansible --version):

ansible 2.7.12
  config file = /home/core/kubespray/ansible.cfg
  configured module search path = ['/home/core/kubespray/library']
  ansible python module location = /usr/local/lib/python3.5/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.5.2 (default, Oct  8 2019, 13:06:37) [GCC 5.4.0 20160609]

Kubespray version (commit) (git rev-parse --short HEAD):

86cc703

Network plugin used:

Calico

Copy of your inventory file:

inital cluster

all:
  hosts:
    node1:
      ip: x.x.x.5
      access_ip: x.x.x.5
      ansible_host: x.x.x.5
    node2:
      ip: x.x.x.6
      access_ip: x.x.x.6
      ansible_host: x.x.x.6
    node3:
      ip: x.x.x.7
      access_ip: x.x.x.7
      ansible_host: x.x.x.7
  children:
    kube-master:
      hosts:
        node1:
        node3:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
    etcd:
      hosts:
        node1:
        node3:
        node2:
    k8s-cluster:
      children:
        kube-node:
        kube-master:
    calico-rr:
      hosts: {}

scale target

all:
  hosts:
    node1:
      ip: x.x.x.5
      access_ip: x.x.x.5
      ansible_host: x.x.x.5
    node2:
      ip: x.x.x.6
      access_ip: x.x.x.6
      ansible_host: x.x.x.6
    node3:
      ip: x.x.x.7
      access_ip: x.x.x.7
      ansible_host: x.x.x.7
    node4:
      ip: x.x.x.2
      access_ip: x.x.x.2
      ansible_host: x.x.x.2
    node5:
      ip: x.x.x.3
      access_ip: x.x.x.3
      ansible_host: x.x.x.3
    node6:
      ip: x.x.x.4
      access_ip: x.x.x.4
      ansible_host: x.x.x.4
  children:
    kube-master:
      hosts:
        node1:
        node3:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
        node5:
        node6:
    etcd:
      hosts:
        node1:
        node3:
        node2:
    k8s-cluster:
      children:
        kube-node:
        kube-master:
    calico-rr:
      hosts: {}

Other place changed

group_vars/all/k8s-cluster/k8s-cluster.yml

# Make a copy of kubeconfig on the host that runs Ansible in {{ inventory_dir }}/artifacts
kubeconfig_localhost: true
# Download kubectl onto the host that runs Ansible in {{ bin_dir }}
kubectl_localhost: true

Command used to invoke ansible:

Deploy inital cluster

ansible-playbook -i inventory/mycluster/hosts.yml cluster.yml --become --become-user=root

Scale, add worker, fail

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml --become --become-user=root

Scale, add worker, successful

ansible-playbook -i inventory/mycluster/hosts.yml scale.yml --become --become-user=root 
-e "kubelet_cgroup_driver=cgroupfs"

Output of ansible run:

Deploy log:
https://gist.github.com/ydye/1f1a9cf63583a273942b5df8eae95963

log of scale failure due to cgroup drivers
https://gist.github.com/ydye/6ed258d72cfedf5389d0590ef15f5ac3

scale successfully
https://gist.github.com/ydye/4752fdd33116522e74cec1b3b496f394

Anything else do we need to know:

I think the 3 nodes which I want to add into the cluster is fine. Because when I try to deploy a cluster with the 6 nodes (include the 3 nodes) with the cluster.yml, the deployment will be successful.

The text was updated successfully, but these errors were encountered:

wangycc · 2019-10-16T07:44:50Z

Oct 16 15:43:53 cn-hz-wl-test-k8s-02 kubelet[78362]: F1016 15:43:53.891981   78362 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

sample question.
Executing scale.yml will affect the already running worker node?

-rw-r--r-- 1 root root 603 Oct 16 15:43 /etc/kubernetes/kubelet-config.yaml

cpiment · 2019-10-30T07:05:00Z

I have also been affected by this issue. I tried to scale the cluster, the playbook finished apparently fine, but the node was not added and all the other nodes passed to NotReady status because of the change of the kubelet cgroup driver change.

Using the extra-var "kubelet_cgroup_driver=cgroupfs" recommended by @ydye was key to solve the issue.

It must be also taken into account that the scale.yml playbook restarts the docker daemon of the complete cluster, making all the pods unavailable during a few seconds.

jklare · 2019-11-04T15:55:43Z

I am also facing this issue, not sure if i should change the cgroup driver of kubelet or docker. I guess the workaround works, but not sure if that is what we want here.

ydye · 2019-11-14T02:22:18Z

I have also been affected by this issue. I tried to scale the cluster, the playbook finished apparently fine, but the node was not added and all the other nodes passed to NotReady status because of the change of the kubelet cgroup driver change.

Using the extra-var "kubelet_cgroup_driver=cgroupfs" recommended by @ydye was key to solve the issue.

It must be also taken into account that the scale.yml playbook restarts the docker daemon of the complete cluster, making all the pods unavailable during a few seconds.

According to this issue, maybe the new worker node could be added with upgrade-cluster.yml. I have try it and found the contianers on other worker node seems fine.

lystor · 2019-11-19T15:55:16Z

Same issue here with Kubespray v2.11.0 + CentOS 7

kubelet: I1119 17:44:32.195010   17070 server.go:1025] Using root directory: /var/lib/kubelet
kubelet: I1119 17:44:32.195035   17070 kubelet.go:281] Adding pod path: /etc/kubernetes/manifests
kubelet: I1119 17:44:32.195095   17070 file.go:68] Watching path "/etc/kubernetes/manifests"
kubelet: I1119 17:44:32.195117   17070 kubelet.go:306] Watching apiserver
kubelet: E1119 17:44:32.197231   17070 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dk8s83&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
kubelet: E1119 17:44:32.197231   17070 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
kubelet: E1119 17:44:32.197349   17070 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3Dk8s83&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: connect: connection refused
kubelet: I1119 17:44:32.199089   17070 client.go:75] Connecting to docker on unix:///var/run/docker.sock
kubelet: I1119 17:44:32.199120   17070 client.go:104] Start docker client with request timeout=2m0s
kubelet: W1119 17:44:32.201295   17070 docker_service.go:561] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth"
kubelet: I1119 17:44:32.201330   17070 docker_service.go:238] Hairpin mode set to "hairpin-veth"
kubelet: W1119 17:44:32.201615   17070 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet: W1119 17:44:32.204678   17070 hostport_manager.go:68] The binary conntrack is not installed, this can cause failures in network connection cleanup.
kubelet: W1119 17:44:32.204747   17070 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet: I1119 17:44:32.204778   17070 plugins.go:161] Loaded network plugin "cni"
kubelet: I1119 17:44:32.204805   17070 docker_service.go:253] Docker cri networking managed by cni
kubelet: W1119 17:44:32.204905   17070 cni.go:213] Unable to update cni config: No networks found in /etc/cni/net.d
kubelet: I1119 17:44:32.220107   17070 docker_service.go:258] Docker Info: &{ID:VXQM:4VS7:4G2O:BPSU:S63E:RCJT:WLLO:GEJH:EOJW:MOD4:W5HA:VCQN Containers:0 ContainersRunning:0 ContainersPaused:0 ContainersStopped:0 Images:0 Driver:overlay2 DriverStatus:[[Backing Filesystem xfs] [Supports d_type true] [Native Overlay Diff true]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:true KernelMemoryTCP:false CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:false IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:24 OomKillDisable:true NGoroutines:45 SystemTime:2019-11-19T17:44:32.205954456+02:00 LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:3.10.0-1062.4.3.el7.x86_64 OperatingSystem:CentOS Linux 7 (Core) OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc00072c070 NCPU:16 MemTotal:16802422784 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:k8s83 Labels:[] ExperimentalBuild:false ServerVersion:18.09.7 ClusterStore: ClusterAdvertise: Runtimes:map[runc:{Path:runc Args:[]}] DefaultRuntime:runc Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster:<nil> Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID:b34a5c8af56e510852c35414db4c1f4fa6172339 Expected:b34a5c8af56e510852c35414db4c1f4fa6172339} RuncCommit:{ID:3e425f80a8c931f88e6d94a8c831b9d5aa481657 Expected:3e425f80a8c931f88e6d94a8c831b9d5aa481657} InitCommit:{ID:fec3683 Expected:fec3683} SecurityOptions:[name=seccomp,profile=default] ProductLicense:Community Engine Warnings:[]}
kubelet: F1119 17:44:32.220300   17070 server.go:273] failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"

wangycc · 2019-12-10T09:36:22Z

from roles/container-engine/containerd/defaults/main.yml file remove kubelet_cgroup_driver:

kubelet_cgroup_driver: systemd

reference:

fix broken scale procedure: #5193

fejta-bot · 2020-03-09T09:57:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-08T10:40:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-08T11:24:19Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-08T11:24:26Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ydye added the kind/bug Categorizes issue or PR as related to a bug. label Oct 14, 2019

ydye mentioned this issue Oct 15, 2019

[k8s deploy] [tracking] Evaluate Kubespray microsoft/pai#3618

Closed

vjtm mentioned this issue Dec 2, 2019

scale.yml #5329

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 9, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 8, 2020

mariob1626 mentioned this issue Apr 9, 2020

[2.11] fix broken scale procedure #5926

Merged

k8s-ci-robot closed this as completed May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

ydye commented Oct 14, 2019

wangycc commented Oct 16, 2019 •

edited

Loading

cpiment commented Oct 30, 2019

jklare commented Nov 4, 2019

ydye commented Nov 14, 2019

lystor commented Nov 19, 2019

wangycc commented Dec 10, 2019 •

edited

Loading

fejta-bot commented Mar 9, 2020

fejta-bot commented Apr 8, 2020

fejta-bot commented May 8, 2020

k8s-ci-robot commented May 8, 2020

Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

Failed to scale worker nodes due to the different cgroup driver between docker and kubelet #5262

Comments

ydye commented Oct 14, 2019

wangycc commented Oct 16, 2019 • edited Loading

cpiment commented Oct 30, 2019

jklare commented Nov 4, 2019

ydye commented Nov 14, 2019

lystor commented Nov 19, 2019

wangycc commented Dec 10, 2019 • edited Loading

fejta-bot commented Mar 9, 2020

fejta-bot commented Apr 8, 2020

fejta-bot commented May 8, 2020

k8s-ci-robot commented May 8, 2020

wangycc commented Oct 16, 2019 •

edited

Loading

wangycc commented Dec 10, 2019 •

edited

Loading