This document describes the reconfiguration and recovery of an etcd cluster running on Container Linux, using a combination of systemd
features and etcdctl
commands. The examples given in this document show the configuration for a three-node Container Linux cluster. Replace the IP addresses used in the examples with the corresponding real IPs.
When a Container Linux Config is used for configuring an etcd member on a Container Linux node, it compiles a special /etc/systemd/system/etcd-member.service.d/20-clct-etcd-member.conf
drop-in unit file. For example:
etcd:
name: demo-etcd-1
listen_client_urls: http://10.240.0.1:2379,http://0.0.0.0:4001
advertise_client_urls: http://10.240.0.1:2379
listen_peer_urls: http://0.0.0.0:2380
initial_advertise_peer_urls: http://10.240.0.1:2380
initial_cluster: demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380
initial_cluster_token: demo-etcd-token
initial_cluster_state: new
The above Container Linux config file can be used to provision a machine. Provisioning with a config file creates the following drop-in:
[Service]
ExecStart=
ExecStart=/usr/lib/coreos/etcd-wrapper $ETCD_OPTS \
--name="demo-etcd-1" \
--listen-peer-urls="http://0.0.0.0:2380" \
--listen-client-urls="http://10.240.0.1:2379,http://0.0.0.0:4001" \
--initial-advertise-peer-urls="http://10.240.0.1:2380" \
--initial-cluster="demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380" \
--initial-cluster-state="new" \
--initial-cluster-token="demo-etcd-token" \
--advertise-client-urls="http://10.240.0.1:2379"
If the etcd cluster is secured with TLS, use https://
instead of http://
in the config files. If the peer addresses for the initial cluster are unknown when provisioning the cluster, use the etcd discovery service with the --discovery="https://discovery.etcd.io/<token>
argument.
Changing the size of an etcd cluster is as simple as adding a new member, and using the output of the member addition, such as name of the new etcd member, member IDs, state and URLs of the cluster, to the config file for provisioning on the Container Linux node.
-
Run the
etcdctl member add
command.For example:
$ etcdctl member add node4 http://10.240.0.4:2380
The output of a successful member addition is given below:
added member 9bf1b35fc7761a23 to cluster ETCD_NAME="node4" ETCD_INITIAL_CLUSTER="demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380,node4=http://10.240.0.4:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
-
Store the output of this command for later use.
-
Use the information from the output of the
etcdctl member add
command and provision a new Container Linux host with the following Container Linux Config:etcd: name: node4 listen_client_urls: http://10.240.0.4:2379,http://0.0.0.0:4001 advertise_client_urls: http://10.240.0.4:2379 listen_peer_urls: http://0.0.0.0:2380 initial_advertise_peer_urls: http://10.240.0.4:2380 initial_cluster: demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380,node4=http://10.240.0.4:2380 initial_cluster_state: existing
-
Check whether the new member node is up and running:
$ etcdctl cluster-health member 9bf1b35fc7761a23 is healthy: got healthy result from http://10.240.0.4:2379 cluster is healthy
If your cluster has healthy state, etcd successfully writes cluster configuration into the /var/lib/etcd
directory.
An etcd member node might fail for several reasons: out of disk space, an incorrect reboot, or issues on the underlying system. This section provides instructions on how to recover a failed etcd member.
Consider a scenario where a member is failed in a three-member cluster. The cluster is still running and has maintained quorum. The example assumes a Container Linux Config is used with the following default options:
etcd:
name: demo-etcd-1
listen_client_urls: http://10.240.0.1:2379,http://0.0.0.0:4001
advertise_client_urls: http://10.240.0.1:2379
listen_peer_urls: http://0.0.0.0:2380
initial_advertise_peer_urls: http://10.240.0.1:2380
initial_cluster: ddemo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380
initial_cluster_token: demo-etcd-token
initial_cluster_state: new
If the etcd cluster is protected with TLS, use https://
instead of http://
in the examples below.
Assume that the given etcd cluster has a faulty member 10.240.0.2
:
$ etcdctl cluster-health
member fe2f75dd51fa5ff is healthy: got healthy result from http://10.240.0.1:2379
failed to check the health of member 1609b5a3a078c227 on http://10.240.0.2:2379: Get http://10.240.0.2:2379/health: dial tcp 10.240.0.2:2379: connection refused
member 1609b5a3a078c227 is unreachable: [http://10.240.0.2:2379] are all unreachable
member 60e8a32b09dc91f1 is healthy: got healthy result from http://10.240.0.3:2379
cluster is healthy
-
Remove the failed member
1609b5a3a078c227
from the etcd cluster.$ etcdctl member remove 1609b5a3a078c227 Removed member 1609b5a3a078c227 from cluster
The remove subcommand informs all other cluster nodes that a human has determined this node is dead and not available for connections.
-
Stop the etcd-member service on the failed node (
10.240.0.2
):$ sudo systemctl stop etcd-member.service
-
Reinitialize the failed member.
$ etcdctl member add demo-etcd-2 http://10.240.0.2:2380 Added member named demo-etcd-2 with ID 4fb77509779cac99 to cluster ETCD_NAME="demo-etcd-2" ETCD_INITIAL_CLUSTER="demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
-
Modify the existing systemd drop-in,
/etc/systemd/system/etcd-member.service.d/20-clct-etcd-member.conf
by replacing the node data with the appropriate information from the output of theetcdctl member add
command executed in the last step.etcd: name: demo-etcd-2 listen_client_urls: http://10.240.0.2:2379,http://0.0.0.0:4001 advertise_client_urls: http://10.240.0.2:2379 listen_peer_urls: http://0.0.0.0:2380 initial_advertise_peer_urls: http://10.240.0.2:2380 initial_cluster: demo-etcd-1=http://10.240.0.1:2380,demo-etcd-2=http://10.240.0.2:2380,demo-etcd-3=http://10.240.0.3:2380 initial_cluster_token: demo-etcd-token initial_cluster_state: existing
-
Check the cluster health:
$ etcdctl cluster-health member e6c2bda2aa1f2dcf is healthy: got healthy result from http://10.240.0.2:2379 cluster is healthy
If your cluster has healthy state, etcd successfully writes cluster configuration into the /var/lib/etcd
directory.
-
Download
etcdctl
from the etcd Release page and install, for example, into/opt/bin
. -
Create a backup directory:
$ sudo mkdir /var/lib/etcd_backup
-
Save a snapshot of the database to
/var/lib/etcd_backup/backup.db
:$ sudo ETCDCTL_API=3 /opt/bin/etcdctl snapshot save /var/lib/etcd_backup/backup.db
-
Restore the snapshot file into a new member directory
/var/lib/etcd_backup/etcd
:$ sudo ETCDCTL_API=3 /opt/bin/etcdctl snapshot --data-dir /var/lib/etcd_backup/etcd restore backup.db \ --name new-demo-etcd-1 \ --initial-cluster new-demo-etcd-1=http://10.240.0.1:2380 --initial-cluster-token new-etcd-cluster-1 \ --initial-advertise-peer-urls http://10.240.0.1:2380
-
Remove the obsolete directory:
$ sudo rm -rf /var/lib/etcd
-
Move the restored member directory to
/var/lib/etcd
:$ sudo mv /var/lib/etcd_backup/etcd /var/lib/
-
Set the etcd user permissions:
$ sudo chown etcd -R /var/lib/etcd
-
Start the etcd member service:
$ sudo systemctl start etcd-member.service
-
Check the node health:
$ etcdctl cluster-health
-
The restored cluster is now running with a single node. For information on adding more nodes, see Change etcd cluster size.
If a cluster is totally broken and quorum cannot be restored, all etcd members must be reconfigured from scratch. This procedure consists of two steps:
- Initialize a one-member etcd cluster using the initial data directory
- Resize this etcd cluster by adding new etcd members by following the steps in the change the etcd cluster size section.
This document is an adaptation for Container Linux of the official etcd disaster recovery guide, and uses systemd drop-ins for convenience.
Consider a three-node cluster with two permanently lost members.
-
Stop the
etcd-member
service on all the members:$ sudo systemctl stop etcd-member.service
If you have etcd proxy nodes, they should update members list automatically according to the
--proxy-refresh-interval
configuration option. -
On one of the member nodes, run the following command to backup the current data directory:
$ sudo etcdctl backup --data-dir /var/lib/etcd --backup-dir /var/lib/etcd_backup
Now that a backup has been created, start a single-member cluster.
-
Create the
/run/systemd/system/etcd-member.service.d/98-force-new-cluster.conf
drop-in file with the following contents:[Service] Environment="ETCD_FORCE_NEW_CLUSTER=true"
-
Run
sudo systemctl daemon-reload
. -
Check whether the new drop-in is valid by looking in its journal for errors:
$ sudo journalctl _PID=1 -e -u etcd-member.service
-
If everything is ok, start the
etcd-member
daemon:$ sudo systemctl start etcd-member.service
-
Check the cluster state:
$ etcdctl member list e6c2bda2aa1f2dcf: name=1be6686cc2c842db035fdc21f56d1ad0 peerURLs=http://10.240.1.2:2380 clientURLs=http://10.240.1.2:2379 $ etcdctl cluster-health member e6c2bda2aa1f2dcf is healthy: got healthy result from http://10.240.1.2:2379 cluster is healthy
-
If the output contains no errors, remove the
98-force-new-cluster.conf
drop-in file.$ rm -rf /run/systemd/system/etcd-member.service.d/98-force-new-cluster.conf
-
Reload systemd services:
$ sudo systemctl daemon-reload
It is not necessary to restart the
etcd-member
service after reloading the systemd services. -
Spin up new nodes. Ensure that the version is given in the config file. For information on adding more nodes, see Change etcd cluster size.