concurrent.Session closed immediately after creation #13675

liberize · 2022-02-09T06:22:56Z

What happened?

I have an etcd cluster of 3 nodes. each node started like this:

/opt/etcd/etcd \
    --name node-1 \
    --data-dir /data/etcd/node-1.etcd \
    --listen-client-urls http://172.25.21.36:2379 \
    --advertise-client-urls http://172.25.21.36:2379 \
    --initial-advertise-peer-urls http://172.25.21.36:2380 \
    --listen-peer-urls http://172.25.21.36:2380 \
    --initial-cluster-token etcd-cluster \
    --initial-cluster node-0=http://172.25.23.46:2380,node-1=http://172.25.21.36:2380,node-2=http://172.25.23.15:2380 \
    --initial-cluster-state new

Then I use this code to create a lot of clients:

package main

import (
	"fmt"
	"github.com/sirupsen/logrus"
	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
	"syscall"
	"time"
)

func checkError(err error, msg string) {
	if err != nil {
		panic(fmt.Sprintf("%v. error=`%v`", msg, err))
	}
}

func newEtcdClient(etcdHosts []string) *concurrency.Session {
	var err error
	etcdClient, err := clientv3.New(clientv3.Config{
		Endpoints:   etcdHosts,
		DialTimeout: 5 * time.Second,
	})
	checkError(err, "create etcd client failed")
	etcdSession, err := concurrency.NewSession(etcdClient, concurrency.WithTTL(10))
	checkError(err, "create etcd session failed")
	logrus.WithField("lease", etcdSession.Lease()).Info("etcd session created")
	go func() {
		select {
		case <-etcdSession.Done():
			logrus.WithField("lease", etcdSession.Lease()).Panic("etcd session disconnected")
		}
	}()
	return etcdSession
}

func main()  {
	var rLimit syscall.Rlimit
	rLimit.Max = 999999
	rLimit.Cur = 999999
	err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &rLimit)
	if err != nil {
		fmt.Println("Error Setting Rlimit ", err)
	}

	etcdHosts := []string{"172.25.21.36:2379","172.25.23.46:2379","172.25.23.15:2379"}
	taskCh := make(chan int, 16)
	exitCh := make(chan struct{}, 0)
	for i := 0; i < 2; i++ {
		go func() {
			for {
				select {
				case k := <-taskCh:
					logrus.WithField("n", k).Info("create client")
					newEtcdClient(etcdHosts)
				case <-exitCh:
					return
				}
			}
		}()
	}
	for i := 0; i < 10000; i++ {
		taskCh <- i
	}
}

It panics every time with only 10-100 clients:

...

INFO[0000] create client                                 n=102
INFO[0000] etcd session created                          lease=6232558402277502693
INFO[0000] create client                                 n=103
INFO[0000] etcd session created                          lease=6485322931366268654
INFO[0000] create client                                 n=104
INFO[0001] etcd session created                          lease=3530117151077574211
INFO[0001] create client                                 n=105
PANI[0001] etcd session disconnected                     lease=3530117151077574211

goroutine 2278 [running]:
github.com/sirupsen/logrus.Entry.log(0xc0000b6000, 0xc004469e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /root/go/pkg/mod/github.com/sirupsen/logrus@v1.6.0/entry.go:259 +0x335
github.com/sirupsen/logrus.(*Entry).Log(0xc003f24fc0, 0x0, 0xc005de8fb8, 0x1, 0x1)
        /root/go/pkg/mod/github.com/sirupsen/logrus@v1.6.0/entry.go:287 +0xeb
github.com/sirupsen/logrus.(*Entry).Panic(0xc003f24fc0, 0xc00528dfb8, 0x1, 0x1)
        /root/go/pkg/mod/github.com/sirupsen/logrus@v1.6.0/entry.go:325 +0x55
main.newEtcdClient.func1(0xc004ec5cb0)
        /mnt/e/Service/m-isvs/tools/etcd_bug_reproduce/reproduce_example.go:31 +0xdc
created by main.newEtcdClient
        /mnt/e/Service/m-isvs/tools/etcd_bug_reproduce/reproduce_example.go:28 +0x229

What did you expect to happen?

The code above runs without problem.

How can we reproduce it (as minimally and precisely as possible)?

Use the code pasted above.

Anything else we need to know?

It seems the server returned a keep alive response with TTL = 0. Then the channel returned by Client.KeepAlive was closed.

https://github.com/etcd-io/etcd/blob/main/client/v3/lease.go#L512

Wireshark proves that:

172.25.21.36 is leader node:

seems like a consistent issue?

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.2
Git SHA: 99018a77b
Go Version: go1.16.3
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.2
API version: 3.5

Etcd configuration (command line flags or environment variables)

node-0:

/opt/etcd/etcd \
    --name node-0 \
    --data-dir /data/etcd/node-0.etcd \
    --listen-client-urls http://172.25.23.46:2379 \
    --advertise-client-urls http://172.25.23.46:2379 \
    --initial-advertise-peer-urls http://172.25.23.46:2380 \
    --listen-peer-urls http://172.25.23.46:2380 \
    --initial-cluster-token etcd-cluster \
    --initial-cluster node-0=http://172.25.23.46:2380,node-1=http://172.25.21.36:2380,node-2=http://172.25.23.15:2380 \
    --initial-cluster-state new

node-1:

/opt/etcd/etcd \
    --name node-1 \
    --data-dir /data/etcd/node-1.etcd \
    --listen-client-urls http://172.25.21.36:2379 \
    --advertise-client-urls http://172.25.21.36:2379 \
    --initial-advertise-peer-urls http://172.25.21.36:2380 \
    --listen-peer-urls http://172.25.21.36:2380 \
    --initial-cluster-token etcd-cluster \
    --initial-cluster node-0=http://172.25.23.46:2380,node-1=http://172.25.21.36:2380,node-2=http://172.25.23.15:2380 \
    --initial-cluster-state new

node-2:

/opt/etcd/etcd \
    --name node-2 \
    --data-dir /data/etcd/node-2.etcd \
    --listen-client-urls http://172.25.23.15:2379 \
    --advertise-client-urls http://172.25.23.15:2379 \
    --initial-advertise-peer-urls http://172.25.23.15:2380 \
    --listen-peer-urls http://172.25.23.15:2380 \
    --initial-cluster-token etcd-cluster \
    --initial-cluster node-0=http://172.25.23.46:2380,node-1=http://172.25.21.36:2380,node-2=http://172.25.23.15:2380 \
    --initial-cluster-state new

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+--------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+--------+--------------------------+--------------------------+------------+
| deb864ebfb3ed67e | started | node-1 | http://172.25.21.36:2380 | http://172.25.21.36:2379 |      false |
| e495e97c1553b0fd | started | node-2 | http://172.25.23.15:2380 | http://172.25.23.15:2379 |      false |
| f424e2551763da00 | started | node-0 | http://172.25.23.46:2380 | http://172.25.23.46:2379 |      false |
+------------------+---------+--------+--------------------------+--------------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 172.25.21.36:2379 | deb864ebfb3ed67e |   3.5.2 |   46 MB |      true |      false |         5 |     318171 |             318171 |        |
| 172.25.23.46:2379 | f424e2551763da00 |   3.5.2 |   46 MB |     false |      false |         5 |     318171 |             318171 |        |
| 172.25.23.15:2379 | e495e97c1553b0fd |   3.5.2 |   46 MB |     false |      false |         5 |     318171 |             318171 |        |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

ahrtr · 2022-02-09T22:28:56Z

Thanks @liberize for the finding. It's a real issue, and I reproduced it in my local environment on both 3.5.2 and the main branch.

It took me a couple of sleeping hours to figure out the root cause. The root cause is the leader hasn't finish applying the LeaseGrantRequest when it receives the following LeaseKeepAliveRequest request.

The rough workflow is something like below (assuming there are 3 members in the cluster),

One follower receives a LeaseGrantRequest requests, then it forwards the request to the leader;
The follower returns success to the client. Actually only the two followers finish the request for now;
The leader receives a LeaseKeepAliveRequest request on the lease. It hasn't finish the previous LeaseGrantRequest, so the returns ErrLeaseNotFound, see lessor.go#L404. Accordingly the client receives a TTL with 0, see lease.go#L138

Since the 3 members are all running on one local server, so the communication between members is really fast. The leader finishes applying the LeaseGrantRequest just about 1~2 microseconds right after it receives the LeaseKeepAliveRequest.

In a real production environment, it rarely happens. But it indeed is a real issue.

I think it should be a generic issue. When etcdserver processes the raft message coming from raft, it applies the change locally in parallel with replicating to the followers. See L212-L224. So in theory, it's possible that the leader hasn't finish applying the request before the client receives a SUCCESS response and sends next request.

Any thoughts on the solution, or do you think there is no need to fix this since it is unlikely reproduced in real production environment? cc @ptabor @serathius @spzala @xiang90 @gyuho

liberize · 2022-02-12T03:19:58Z

fixed by #13690.

ahrtr · 2022-02-12T05:55:13Z

Let's keep this issue open until the PR is reviewed & merged.

serathius · 2022-04-11T12:54:29Z

Fix was merged, will be release in v3.5.3

liberize added the type/bug label Feb 9, 2022

ahrtr self-assigned this Feb 9, 2022

ahrtr mentioned this issue Feb 12, 2022

Leases wait for entries to be applied #13690

Merged

liberize closed this as completed Feb 12, 2022

ahrtr reopened this Feb 12, 2022

This was referenced Apr 6, 2022

Plans for v3.5.3 release #13894

Closed

Make leases linearizable (excluding TTL) #13915

Open

serathius closed this as completed Apr 11, 2022

ahrtr mentioned this issue Apr 12, 2022

[3.5] Support linearizable renew lease #13932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

concurrent.Session closed immediately after creation #13675

concurrent.Session closed immediately after creation #13675

liberize commented Feb 9, 2022

ahrtr commented Feb 9, 2022 •

edited

Loading

liberize commented Feb 12, 2022

ahrtr commented Feb 12, 2022

serathius commented Apr 11, 2022

concurrent.Session closed immediately after creation #13675

concurrent.Session closed immediately after creation #13675

Comments

liberize commented Feb 9, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Feb 9, 2022 • edited Loading

liberize commented Feb 12, 2022

ahrtr commented Feb 12, 2022

serathius commented Apr 11, 2022

ahrtr commented Feb 9, 2022 •

edited

Loading