Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method (*EtcdServer) IsRaftLoopBlocked to support checking whether the raft loop is blocked #16710

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Oct 8, 2023

@ahrtr ahrtr force-pushed the check_raftloop_20231009 branch 3 times, most recently from a398af0 to b90aef1 Compare October 8, 2023 20:05
…r the raft loop is blocked

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr ahrtr force-pushed the check_raftloop_20231009 branch from b90aef1 to fc7902a Compare October 8, 2023 20:11
@ahrtr
Copy link
Member Author

ahrtr commented Oct 8, 2023

cc @serathius

@chaochn47
Copy link
Member

chaochn47 commented Oct 9, 2023

Copied from the design doc comment:

A counter could be added to the raftNode.tick function [1] and the prober could just look up the counter to decide if the raft loop is deadlocked or not.

[1] https://github.com/etcd-io/etcd/blob/aa97484166d2b3fb6afeb4390344e68b02afb566/server/etcdserver/raft.go#L155-L159

Since the raft loop deadlock will block the next select statement execution:

I can see there are two approaches:

  1. prober sends a request to the etcd server and waits for a response back with a configurable waiting timeout.
  2. prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

With the goal of prober check fittng in the 1s timeout, looks like the 2nd approach is better, what do you think? @ahrtr

@chaochn47
Copy link
Member

cc @siyuanfoundation ^

@ahrtr
Copy link
Member Author

ahrtr commented Oct 9, 2023

2. prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

This means that you need to remember the timestamp of the last tick, and check it in the liveness probe something like time.Sub(now - t) > timeout. You will get immediate result, but It will be affected when clock drift. I would suggest to avoid this approach.

Note that this PR just provides a basic functionality for checking if the raftloop blocks. You can check it async.

@chaochn47
Copy link
Member

This means that you need to remember the timestamp of the last tick, and check it in the liveness probe something like time.Sub(now - t) > timeout. You will get immediate result, but It will be affected when clock drift. I would suggest to avoid this approach.

It's not necessary, sent a drafted PR to demonstrate #16713.

Note that this PR just provides a basic functionality for checking if the raftloop blocks. You can check it async.

Yeah, assuming the "async" here means try send to the dummy channel and then go ahead to the remaining checks in the prober and then validate the previous sent has completed. It seems complicated compared with the counter approach. WDYT?

@siyuanfoundation
Copy link
Contributor

Copied from the design doc comment:

A counter could be added to the raftNode.tick function [1] and the prober could just look up the counter to decide if the raft loop is deadlocked or not.

[1] https://github.com/etcd-io/etcd/blob/aa97484166d2b3fb6afeb4390344e68b02afb566/server/etcdserver/raft.go#L155-L159

Since the raft loop deadlock will block the next select statement execution:

I can see there are two approaches:

  1. prober sends a request to the etcd server and waits for a response back with a configurable waiting timeout.
  2. prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

With the goal of prober check fittng in the 1s timeout, looks like the 2nd approach is better, what do you think? @ahrtr

Copied from the design doc comment:

A counter could be added to the raftNode.tick function [1] and the prober could just look up the counter to decide if the raft loop is deadlocked or not.

[1] https://github.com/etcd-io/etcd/blob/aa97484166d2b3fb6afeb4390344e68b02afb566/server/etcdserver/raft.go#L155-L159

Since the raft loop deadlock will block the next select statement execution:

I can see there are two approaches:

  1. prober sends a request to the etcd server and waits for a response back with a configurable waiting timeout.
  2. prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

With the goal of prober check fittng in the 1s timeout, looks like the 2nd approach is better, what do you think? @ahrtr

I think either way, the interval between checks could be too short to differentiate a slow ready process from a deadlock. I suggest using another longer ticker to reset the count instead of resetting based on probes in #16713

@chaochn47
Copy link
Member

I think either way, the interval between checks could be too short to differentiate a slow ready process from a deadlock.

In the second approach, it's determined by the administrator case by case. 'prober interval * failure threshold' should be a sane value based on administrator judgement. e.g. if they are using a network based volume like EBS or physically attached SSD..

I suggest using another longer ticker to reset the count instead of resetting based on probes in #16713

Adding another ticker may not be optimal, how will you plan set the new ticker interval, is it configurable? It may make etcd set up more complicated.

@ahrtr ahrtr marked this pull request as draft October 10, 2023 11:55
Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants