Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

Open
haojinming opened this issue Oct 26, 2022 · 13 comments · May be fixed by #16385
Open

concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) #14631

haojinming opened this issue Oct 26, 2022 · 13 comments · May be fixed by #16385

Comments

@haojinming
Copy link

haojinming commented Oct 26, 2022

What happened?

concurrency.NewSession hang after etcd server is kill by SIGSTOP(19)

What did you expect to happen?

NewSession can return error after server is killed.

How can we reproduce it (as minimally and precisely as possible)?

  1. start three or more etcd server nodes.
  2. run main with following codes.
  3. kill -19 pidof etcd leader
package main

import (
	"fmt"
	"time"

	"github.com/pingcap/log"
	clientv3 "go.etcd.io/etcd/client/v3"
	"go.etcd.io/etcd/client/v3/concurrency"
	"go.uber.org/zap"
)

func initEtcdClient() *clientv3.Client {
	var client *clientv3.Client
	var err error
	endpoints := []string{"172.16.5.32:2379", "172.16.5.32:2382", "172.16.5.32:2384"}
	client, err = clientv3.New(clientv3.Config{
		Endpoints:            endpoints,
		DialTimeout:          5 * time.Second,
		DialKeepAliveTimeout: 5 * time.Second,
	})
	if err != nil {
		fmt.Printf("create client fail:%v\\n", err)
		log.Panic("create client fail", zap.Error(err))
	}
	return client
}

func main() {

	number := 0
	client := initEtcdClient()
	for {
		log.Info("create session begin.", zap.Int("time", number))
		s, err := concurrency.NewSession(client)
		if err != nil {
			log.Panic("create client fail", zap.Error(err))
		}
		log.Info("create session finish.", zap.Int("time", number))
		s.Close()
		number++
		time.Sleep(time.Second)
	}
}

Anything else we need to know?

If re-create etcd client after kill -19, it can return error. However, in our application, the client is created at the beginning and stored to use in the while lifecycle of the application.

Etcd version (please run commands below)

go.etcd.io/etcd/client/v3 v3.5.5 ```console $ etcd --version # paste output here

$ etcdctl version

paste output here


</details>


### Etcd configuration (command line flags or environment variables)

<details>

# paste your configuration here

</details>


### Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@haojinming haojinming changed the title concurrency.NewSession hang after etcd server is kill by SIGSTOP(19) concurrency.NewSession hang after etcd server is killed with SIGSTOP(19) Oct 27, 2022
@haojinming
Copy link
Author

If kill etcd leader with SIGKILL(9), the following error log will print and NewSession can continue acting after new leader is elected.

{"level":"warn","ts":"2022-10-27T10:55:32.940+0800","logger":"etcd-client","caller":"v3@v3.5.5/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000257340/172.16.5.32:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}

I think it should has same behaviour when kill with SIGSTOP, isn't it?

@zeminzhou
Copy link

@ahrtr How about add a configuration item that controls the timeout of grpc function calls when create the etcd client?

@MatteoGioioso
Copy link

MatteoGioioso commented Nov 12, 2022

I might be completely wrong here, but I think the cause of this behavior is this option: defaultWaitForReady = grpc.WaitForReady(true) in /go.etcd.io/etcd/client/v3@v3.5.5/options.go;
since is not accessible I have tried to set it to false by temporarily modify the file and in my case it does not hang anymore.

@zeminzhou
Copy link

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true.

@MatteoGioioso
Copy link

I have no idea, I have found an old issue that proposed to expose some of those options, but it went on stale.

Here is the issue: #13344

@halegreen
Copy link
Contributor

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

@haojinming
Copy link
Author

SIGSTOP(19) of the etcd leader, new leader won't be selected ?

Yes, the main program hang without any log and action.

@huangjiao-heart
Copy link

huangjiao-heart commented Feb 22, 2023

Thanks, I changed grpc.WaitForReady to false, it works. But I don't understand why grpc.WaitForReady is hardcoded to true.
@zeminzhou
I also encountered the same problem, How to change this value in our code?

@MatteoGioioso
Copy link

MatteoGioioso commented Feb 22, 2023

I also encountered the same problem, How to change this value in our code?

@huangjiao-heart unfortunately you can't as it is a private variable, you have to change it from the package.

To be more clear, you have to literally open the file where the variable is hardcoded and change it manually. Or either make a fork and change it from there.

@AngstyDuck
Copy link
Contributor

I'm still quite new to this, but adding a timeout to the LeaseGrant request seem to raise the appropriate error when the servers are killed. Would there be any negative repercussions if we do so?

@fuweid
Copy link
Member

fuweid commented Jul 20, 2023

There are two cases:

  1. If the ETCD server has been paused by SIGSTOP, the client will wait for the connection ready because of WaitForReady. It can be solved by exporting option.

  2. If the ETCD server is paused after connection ready, the client will wait for http2 response forever. It should be handled by ctx option when you new session.

I'm not sure that reason about pausing the process. It's freezing. It will impact all the ready connections.
If you want to stop the server, you should use SIGTERM or SIGKILL.

@AngstyDuck
Copy link
Contributor

AngstyDuck commented Jul 23, 2023

If you want to stop the server, you should use SIGTERM or SIGKILL.

This bug can be replicated if the server is stopped by SIGTERM or SIGKILL as well.

The freeze occurs during the creation of a new session, where client attempts to send a LeaseGrant request to the server. Because the grpc option WaitForReady is set to true, the creation of the RPC is queued until the connection is ready, without returning any errors.

I understand from the provided documentation that this option is set to minimise error responses from transient failures. In that case, I'd like to propose setting a timeout to the LeaseGrant gRPC request specifically during the creation of new sessions. I'm thinking of exposing the timeout duration as a configurable value (I'm thinking of naming it as SessionCreationTimeout for now) in the Config object in etcd/client/v3/config.go. Do let me know if this has any negative side effects, otherwise I'll be preparing a PR for this over the next few days 👍

@fuweid
Copy link
Member

fuweid commented Jul 24, 2023

@AngstyDuck Yes. gRPC has a background goroutine picker to collect available connection to new transporter. I think it's good option to user if they want to disable waitForReady.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

8 participants