Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some kube-system services stuck in Pending state #6

Closed
darvelo opened this issue Sep 17, 2023 · 4 comments
Closed

Some kube-system services stuck in Pending state #6

darvelo opened this issue Sep 17, 2023 · 4 comments

Comments

@darvelo
Copy link

darvelo commented Sep 17, 2023

Any idea why kube-dns, hubble, and others would be stuck in Pending state?

kubectl -n kube-system get pods gives:

NAME                                                       READY   STATUS    RESTARTS   AGE
anetd-cf8jz                                                1/1     Running   0          13m
anetd-q5vzr                                                1/1     Running   0          13m
anetd-rzk7g                                                1/1     Running   0          13m
antrea-controller-horizontal-autoscaler-7b69d9bfd7-f82m6   0/1     Pending   0          13m
event-exporter-gke-7bf6c99dcb-grmz7                        0/2     Pending   0          13m
filestore-node-4vd54                                       3/3     Running   0          13m
filestore-node-86dbn                                       3/3     Running   0          13m
filestore-node-dssdr                                       3/3     Running   0          13m
fluentbit-gke-f9hh9                                        2/2     Running   0          13m
fluentbit-gke-m2hqb                                        2/2     Running   0          13m
fluentbit-gke-wscl5                                        2/2     Running   0          13m
gke-metadata-server-2q8q5                                  1/1     Running   0          13m
gke-metadata-server-5xgg5                                  1/1     Running   0          13m
gke-metadata-server-hmz6s                                  1/1     Running   0          13m
hubble-generate-certs-init-64mnp                           0/1     Pending   0          13m
hubble-relay-677f85b964-v2cxd                              0/2     Pending   0          14m
konnectivity-agent-autoscaler-5d9dbcc6d8-swvst             0/1     Pending   0          14m
konnectivity-agent-fb695849d-6ks95                         0/1     Pending   0          13m
konnectivity-agent-fb695849d-hdq7q                         0/1     Pending   0          14m
konnectivity-agent-fb695849d-qvck9                         0/1     Pending   0          13m
kube-dns-7f58849488-rngxv                                  0/3     Pending   0          13m
kube-dns-7f58849488-rtb7g                                  0/3     Pending   0          14m
kube-dns-autoscaler-84b8db4dc7-4qpmx                       0/1     Pending   0          13m
l7-default-backend-d86c96845-6mhrm                         0/1     Pending   0          14m
metrics-server-v0.5.2-8569bc4cf9-rt26w                     0/2     Pending   0          14m
netd-74jz8                                                 1/1     Running   0          13m
netd-ckswg                                                 1/1     Running   0          13m
netd-k6pzk                                                 1/1     Running   0          13m
pdcsi-node-csvx5                                           2/2     Running   0          13m
pdcsi-node-n46x7                                           2/2     Running   0          13m
pdcsi-node-xvqkx                                           2/2     Running   0          13m

kubectl -n kube-system describe pod for hubble-generate-certs-init-64mnp and hubble-relay-677f85b964-v2cxd and kube-dns pods returns:

Events:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Warning  FailedScheduling   16m (x2 over 16m)   default-scheduler   no nodes available to schedule pods
  Normal   NotTriggerScaleUp  16m                 cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   16m                 default-scheduler   0/1 nodes are available: 1 node(s) had untolerated taint {node.cilium.io/agent-not-ready: true}. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  95s (x84 over 15m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {node.cilium.io/agent-not-ready: true}
  Warning  FailedScheduling   9s (x3 over 11m)    default-scheduler   0/3 nodes are available: 3 node(s) had untolerated taint {node.cilium.io/agent-not-ready: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

kubectl exec -it -n kube-system deployment/hubble-relay -c hubble-cli -- hubble gives:

Error from server (BadRequest): pod hubble-relay-677f85b964-v2cxd does not have a host assigned

My config vars are:

dataplane_v2_enabled = true
enable_dpv2_hubble   = true
machine_type       = "e2-standard-2"
preemptible        = false
disk_size_gb       = 40
initial_node_count = 3
min_nodes          = 3
max_nodes          = 6

Strange because kubectl get nodes is:

NAME                                               STATUS   ROLES    AGE   VERSION
gke-cluster-nodepool-d5a1f7ad-cf52   Ready    <none>   26m   v1.27.3-gke.100
gke-cluster-nodepool-d5a1f7ad-gm5c   Ready    <none>   26m   v1.27.3-gke.100
gke-cluster-nodepool-d5a1f7ad-pwhp   Ready    <none>   26m   v1.27.3-gke.100

So it seems like the nodes are up and running in my zonal cluster.

@darvelo
Copy link
Author

darvelo commented Sep 17, 2023

Commenting out the cilium taint from terraform.tfvars seems to have fixed the issue. All the pods and hubble-ui seem to be running well now.

Maybe it's not needed since Dataplane v2 comes with Cilium, or maybe GCP changed the taint key, but I couldn't find much documentation on this other than https://docs.cilium.io/en/stable/installation/taints/ which recommends NoExecute over NoSchedule, but I tried removing the taint before trying to see if NoExecute would work or not.

@Neutrollized
Copy link
Owner

Oh yes. My bad, in my recent additions of the GKE DPV2 Observability tools, I added those settings to the sample terraform.tfvars file and for got I had a taint in there. dataplane_v2_enabled = true and the taint example I have there are mutually exclusive. The taint is needed only if you're going to install the open-source Cilium. Enabling DPV2 will have the GKE cluster come with a stripped down, downstream version of Cilium pre-installed. I'll update this in next update. Thanks!

@Neutrollized
Copy link
Owner

I pushed the updates in v0.14.1. Thanks for letting me know! (also updated the proxy subnet purpose setting to reflect the new name)

@darvelo
Copy link
Author

darvelo commented Sep 18, 2023

Thanks @Neutrollized! 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants