Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

Closed
Tracked by #14750
hongchaodeng opened this issue Apr 21, 2017 · 5 comments · Fixed by #13224
Closed
Tracked by #14750

etcd pod stuck at bootstrap and kept "failed resolving host" #7798

hongchaodeng opened this issue Apr 21, 2017 · 5 comments · Fixed by #13224

Comments

@hongchaodeng
Copy link
Contributor

Failure scenario:

  • Tried to add 3 members one by one: "00", "01", "02".
  • After "02" was added, but before "02" started (there is 5s sleep), "00" was removed.
  • Since "00" was removed, its DNS record was also removed
  • After "02" started, it kept failing with error pkg/netutil: failed resolving host xxx-00:2380 (lookup xxx-00 on 10.43.240.10:53: no such host); retrying in 1s.

Expected output:
"02" was basically stuck at bootstrap. But since both "01" and "02" were up, they were expected to form a quorum and able to serve requests.

@xiang90
Copy link
Contributor

xiang90 commented Oct 4, 2017

moved to 3.4

@xiang90 xiang90 modified the milestones: v3.3.0, v3.4.0 Oct 4, 2017
@worp1900
Copy link

I seem to have stumbled upon this issue or at least a similar error output:

docker container logs percona_etcd_1                                                                                                                                                                            411ms  Tue Apr 24 20:55:43 2018
2018-04-24 18:54:03.741246 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://galera_etcd:2379,http://galera_etcd:4001
2018-04-24 18:54:03.741314 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/opt/etcd/data
2018-04-24 18:54:03.741343 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://galera_etcd:2380
2018-04-24 18:54:03.741352 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd0=http://galera_etcd:2380
2018-04-24 18:54:03.741356 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=new
2018-04-24 18:54:03.741363 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
2018-04-24 18:54:03.741371 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
2018-04-24 18:54:03.741381 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
2018-04-24 18:54:03.741396 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd0
2018-04-24 18:54:03.741442 I | etcdmain: etcd Version: 3.3.3
2018-04-24 18:54:03.741448 I | etcdmain: Git SHA: e348b1aed
2018-04-24 18:54:03.741451 I | etcdmain: Go Version: go1.9.5
2018-04-24 18:54:03.741454 I | etcdmain: Go OS/Arch: linux/amd64
2018-04-24 18:54:03.741461 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2018-04-24 18:54:03.741473 N | etcdmain: failed to detect default host (could not find default route)
2018-04-24 18:54:03.741581 I | embed: listening for peers on http://0.0.0.0:2380
2018-04-24 18:54:03.741609 I | embed: listening for client requests on 0.0.0.0:2379
2018-04-24 18:54:03.741629 I | embed: listening for client requests on 0.0.0.0:4001
2018-04-24 18:54:13.743978 W | pkg/netutil: failed resolving host galera_etcd:2380 (lookup galera_etcd on 127.0.0.11:53: read udp 127.0.0.1:55975->127.0.0.11:53: i/o timeout); retrying in 1s
2018-04-24 18:54:24.744523 W | pkg/netutil: failed resolving host galera_etcd:2380 (lookup galera_etcd on 127.0.0.11:53: read udp 127.0.0.1:43092->127.0.0.11:53: i/o timeout); retrying in 1s
2018-04-24 18:54:33.743635 W | pkg/netutil: failed resolving host galera_etcd:2380 (i/o timeout); retrying in 1s
2018-04-24 18:54:33.743722 E | pkg/netutil: could not resolve host galera_etcd:2380
2018-04-24 18:54:33.744670 C | etcdmain: failed to resolve http://galera_etcd:2380 to match --initial-cluster=etcd0=http://galera_etcd:2380 (failed to resolve "http://galera_etcd:2380" (i/o timeout))

When starting an etcd service:

etcd:
    image: quay.io/coreos/etcd
    command: etcd
    volumes:
    - etcd_data:/etc/ssl/certs
    ports:
    - "2379:2379"
    - "2380:2380"
    env_file: etcd.env
    networks:
    - etcd

With these env vars:

ETCD_DATA_DIR=/opt/etcd/data
ETCD_NAME=etcd0
ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379,http://0.0.0.0:4001
ETCD_ADVERTISE_CLIENT_URLS=http://galera_etcd:2379,http://galera_etcd:4001
ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://galera_etcd:2380
ETCD_INITIAL_CLUSTER=etcd0=http://galera_etcd:2380
ETCD_INITIAL_CLUSTER_STATE=new
ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1

Somehow I have the feeling that this is a DNS resolving issue or port problem. But opening 2379 and 2380 didn't solve the problem and I wasn't able to dig deep enough into Docker DNS and hostname resolving to be able to analyse.

This is running on a Photon OS host:

root@test-010 [ /etc/systemd/scripts ]# cat /etc/*-release
DISTRIB_ID="VMware Photon OS"
DISTRIB_RELEASE="2.0"
DISTRIB_CODENAME=Photon
DISTRIB_DESCRIPTION="VMware Photon OS 2.0"
NAME="VMware Photon OS"
VERSION="2.0"
ID=photon
VERSION_ID=2.0
PRETTY_NAME="VMware Photon OS/Linux"
ANSI_COLOR="1;34"
HOME_URL="https://vmware.github.io/photon/"
BUG_REPORT_URL="https://github.com/vmware/photon/issues"
VMware Photon OS 2.0
PHOTON_BUILD_NUMBER=304b817


root@test-010 [ /etc/systemd/scripts ]# docker version
Client:
Version:      17.06.0-ce
API version:  1.30
Go version:   go1.8.1
Git commit:   02c1d87
Built:        Thu Oct 26 06:33:23 2017
OS/Arch:      linux/amd64

Server:
Version:      17.06.0-ce
API version:  1.30 (minimum version 1.12)
Go version:   go1.8.1
Git commit:   02c1d87
Built:        Thu Oct 26 06:34:46 2017
OS/Arch:      linux/amd64
Experimental: false

@gyuho gyuho modified the milestones: etcd-v3.4, etcd-v3.5 Aug 5, 2019
@futangwa
Copy link

futangwa commented May 7, 2020

Hi, @gyuho @xiang90 @hongchaodeng

I have the same issue but a little different scenario:
I had 3 etcds as a cluster, for example, etcd0, etcd1, etcd2. The cluster was setup successfully. Then, for example, I took etcd2 down and did not bring it back. The cluster works with etcd0 and etcd1. But if now I reboot the etcd1, and etcd1 failed to come up and join the cluster successfully. It keeps rolling the log below and restarting:
... ...
2020-05-07 16:23:39.252814 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:40.261997 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:41.272573 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:42.278582 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:43.287286 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:44.300703 W | pkg/netutil: failed resolving host etcd2.etcd:7001 (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving); retrying in 1s
2020-05-07 16:23:44.875263 E | pkg/netutil: could not resolve host etcd2.etcd:7001
2020-05-07 16:23:44.877676 C | etcdmain: error validating peerURLs {ClusterID:4e064b7f85ef47b8 Members:[&{ID:27cec8c992d21488 RaftAttributes:{PeerURLs:[http://etcd0.etcd:7001]} Attributes:{Name:etcd0 ClientURLs:[http://10.254.90.234:4001]}} &{ID:4a258c8c3afe2411 RaftAttributes:{PeerURLs:[http://etcd2.etcd:7001]} Attributes:{Name:etcd2 ClientURLs:[http://10.254.10.10:4001]}} &{ID:87fd0805633d77bd RaftAttributes:{PeerURLs:[http://etcd1.etcd:7001]} Attributes:{Name:etcd1 ClientURLs:[http://10.254.168.221:4001]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs (failed to resolve "http://etcd2.etcd:7001" (lookup etcd2.etcd on 10.234.0.2:53: server misbehaving))
2020-05-07 16:23:44,897 INFO exited: etcd (exit status 1; not expected)

The rolling log about 'failed resolving host ' was expected in my testing environment. But since we had etcd0 and etcd1, I expected etcd1 at lease could come up again and join the cluster with its restarting.

I'm using an old etcd version, 3.3. Was this issue fixed in later release or still there?

thanks a lot for your information.

@sakateka
Copy link
Contributor

Hi!
I have investigated this problem.
The main reason for the falling is the lack of a record in the DNS.

Steps to reproduce the problem.

  1. Run a cluster of 3 nodes (let's say e1 e2 e3)
  2. Delete the dns record for one of the nodes (let's say e3)
  3. Wait for the dns cache to stop resolving the deleted record.
  4. Turn off one of the nodes (let's say e2)
  5. Delete the data for this node (WAL for e2)
  6. Try to start the node (e2) with ' --initial-cluster-state existing`

In the current code, the result will be an error at the start of the e2 node
In the log, the last line will be something like the following

... PeerURLs: no match found for existing member (3d070a4fab288fc1, [http://e3.lan:32380]), last resolver error (failed to resolve \"http://e3.lan:32380\" (context deadline exceeded))" ...

The fall is caused by the netutil.URLStringsEqual function (tag v3.5.0)

After studying the use of this function, I came to the conclusion that it would be completely valid behavior to add a short path to the execution of the function. If the comparison of URLs before performing address resolution was successful, the function does not need to access the resolver, the function can already return true

To make it easier to reproduce the current problem, I wrote a bash script that fully automates the steps leading to the problem described in this issue

#!/bin/bash
set -xue
HOSTS=${HOSTS:-/etc/hosts}

if ! grep docker /proc/1/cgroup; then
    echo "Run me inside docker"
    exit 1
fi

if ! test -w $HOSTS; then
    echo "It is assumed that you have write rights to the $HOSTS"
    exit 1
fi

trap 'pkill -9 -x etcd' TERM EXIT ERR

TYPE="${1:-v3.5.0}"

instance() {
    rm -vrf "e$1"
    name="e$1"

    declare -A ports=(
        [e1]=2379
        [e2]=22379
        [e3]=32379
    )

    exec /$TYPE/etcd --name $name \
      --data-dir $name \
      --listen-client-urls http://127.0.0.1"$1":${ports[$name]} \
      --advertise-client-urls http://${name}.lan:${ports[$name]} \
      --listen-peer-urls http://127.0.0.1"$1":$((${ports[$name]}+1)) \
      --initial-advertise-peer-urls http://${name}.lan:$((${ports[$name]}+1)) \
      --initial-cluster e1=http://e1.lan:2380,e2=http://e2.lan:22380,e3=http://e3.lan:32380 \
      --initial-cluster-token tkn \
      --initial-cluster-state ${2:-new} &> $name.log
}

check() {
    ETCDCTL_API=3 /$TYPE/etcdctl \
      --endpoints e1.lan:2379,e2.lan:22379,e3.lan:32379 \
      endpoint health
}

ls -lh /$TYPE/
DIR="etcd-$TYPE"
mkdir -vp $DIR
pushd $DIR

echo "Fix $HOSTS"
echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan
127.0.0.13 e3.lan" > $HOSTS
grep -P 'e\d.lan' $HOSTS

instance 1 &
E1PID=$!

instance 2 &
E2PID=$!

instance 3 &
E3PID=$!

until check; do
    sleep 5 # wait for cluster
done

kill -9 $E2PID $E3PID

echo "127.0.0.11 e1.lan
127.0.0.12 e2.lan" > $HOSTS
cat $HOSTS
sleep 5

instance 2 existing &
E2PID=$!
until check; do
    sleep 5
    ps axf|grep '\<[e]tcd '
    tail -n1 e*.log
done

First you need to prepare two directories with binary files etcd etcdctl for version v3.5.0 and for the version from my PR #13224
Then run this command docker run -it --rm -v $PWD/v3.5.0:/v3.5.0 -v $PWD:/cwd -v $PWD/../bin/:/patched ubuntu /cwd/issue-7798.sh [patched], where pathced is an optional argument
I did it from the directory ~/github.com/etcd-io/etcd/tmp, pre-launch the code build ~/github.com/etcd-io/etcd/build.sh

@sakateka
Copy link
Contributor

Hi, @gyuho @xiang90 @ptabor @spzala!
Please can you take a look at this issue and related PR.
Thank you!

wilsonwang371 pushed a commit to wilsonwang371/etcd that referenced this issue Oct 29, 2021
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798
wilsonwang371 pushed a commit to wilsonwang371/etcd that referenced this issue Nov 3, 2021
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798
pchan pushed a commit to pchan/etcd that referenced this issue Oct 11, 2022
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798
pchan pushed a commit to pchan/etcd that referenced this issue Oct 11, 2022
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798

Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
pchan pushed a commit to pchan/etcd that referenced this issue Oct 12, 2022
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798
pchan pushed a commit to pchan/etcd that referenced this issue Oct 12, 2022
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798

Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
pchan pushed a commit to pchan/etcd that referenced this issue Oct 12, 2022
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798

Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
@serathius serathius mentioned this issue Nov 14, 2022
22 tasks
tjungblu pushed a commit to tjungblu/etcd that referenced this issue Jul 26, 2023
If one of the nodes in the cluster has lost a dns record,
restarting the second node will break it.
This PR makes an attempt to add a comparison without using a resolver,
which allows to protect cluster from dns errors and does not break
the current logic of comparing urls in the URLStringsEqual function.
You can read more in the issue etcd-io#7798

Fixes etcd-io#7798

Signed-off-by: Prasad Chandrasekaran <prasadc@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

7 participants