Add method `(*EtcdServer) IsRaftLoopBlocked` to support checking whether the raft loop is blocked #16710

ahrtr · 2023-10-08T18:54:14Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

…r the raft loop is blocked Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2023-10-08T20:13:48Z

chaochn47 · 2023-10-09T04:29:13Z

Copied from the design doc comment:

A counter could be added to the raftNode.tick function [1] and the prober could just look up the counter to decide if the raft loop is deadlocked or not.

[1] https://github.com/etcd-io/etcd/blob/aa97484166d2b3fb6afeb4390344e68b02afb566/server/etcdserver/raft.go#L155-L159

Since the raft loop deadlock will block the next select statement execution:

I can see there are two approaches:

prober sends a request to the etcd server and waits for a response back with a configurable waiting timeout.
prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

With the goal of prober check fittng in the 1s timeout, looks like the 2nd approach is better, what do you think? @ahrtr

chaochn47 · 2023-10-09T04:31:15Z

cc @siyuanfoundation ^

ahrtr · 2023-10-09T10:10:21Z

2. prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

This means that you need to remember the timestamp of the last tick, and check it in the liveness probe something like time.Sub(now - t) > timeout. You will get immediate result, but It will be affected when clock drift. I would suggest to avoid this approach.

Note that this PR just provides a basic functionality for checking if the raftloop blocks. You can check it async.

chaochn47 · 2023-10-09T15:21:57Z

This means that you need to remember the timestamp of the last tick, and check it in the liveness probe something like time.Sub(now - t) > timeout. You will get immediate result, but It will be affected when clock drift. I would suggest to avoid this approach.

It's not necessary, sent a drafted PR to demonstrate #16713.

Note that this PR just provides a basic functionality for checking if the raftloop blocks. You can check it async.

Yeah, assuming the "async" here means try send to the dummy channel and then go ahead to the remaining checks in the prober and then validate the previous sent has completed. It seems complicated compared with the counter approach. WDYT?

siyuanfoundation · 2023-10-09T16:13:55Z

Copied from the design doc comment:

A counter could be added to the raftNode.tick function [1] and the prober could just look up the counter to decide if the raft loop is deadlocked or not.

[1] https://github.com/etcd-io/etcd/blob/aa97484166d2b3fb6afeb4390344e68b02afb566/server/etcdserver/raft.go#L155-L159

Since the raft loop deadlock will block the next select statement execution:

I can see there are two approaches:

prober sends a request to the etcd server and waits for a response back with a configurable waiting timeout.

prober queries the etcd server if in the past x second, there is at least one select statement execution which is the raft tick timer. It immediately sends the response back.

With the goal of prober check fittng in the 1s timeout, looks like the 2nd approach is better, what do you think? @ahrtr

I think either way, the interval between checks could be too short to differentiate a slow ready process from a deadlock. I suggest using another longer ticker to reset the count instead of resetting based on probes in #16713

chaochn47 · 2023-10-10T00:09:44Z

I think either way, the interval between checks could be too short to differentiate a slow ready process from a deadlock.

In the second approach, it's determined by the administrator case by case. 'prober interval * failure threshold' should be a sane value based on administrator judgement. e.g. if they are using a network based volume like EBS or physically attached SSD..

I suggest using another longer ticker to reset the count instead of resetting based on probes in #16713

Adding another ticker may not be optimal, how will you plan set the new ticker interval, is it configurable? It may make etcd set up more complicated.

stale · 2024-03-17T12:56:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr force-pushed the check_raftloop_20231009 branch 3 times, most recently from a398af0 to b90aef1 Compare October 8, 2023 20:05

add method (*EtcdServer) IsRaftLoopBlocked to support checking whethe…

fc7902a

…r the raft loop is blocked Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr force-pushed the check_raftloop_20231009 branch from b90aef1 to fc7902a Compare October 8, 2023 20:11

ahrtr mentioned this pull request Oct 9, 2023

raft loop prober with counter #16713

Draft

ahrtr marked this pull request as draft October 10, 2023 11:55

stale bot added the stale label Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add method `(*EtcdServer) IsRaftLoopBlocked` to support checking whether the raft loop is blocked #16710

Add method `(*EtcdServer) IsRaftLoopBlocked` to support checking whether the raft loop is blocked #16710

ahrtr commented Oct 8, 2023

ahrtr commented Oct 8, 2023

chaochn47 commented Oct 9, 2023 •

edited

Loading

chaochn47 commented Oct 9, 2023

ahrtr commented Oct 9, 2023

chaochn47 commented Oct 9, 2023

siyuanfoundation commented Oct 9, 2023

chaochn47 commented Oct 10, 2023

stale bot commented Mar 17, 2024

Add method (*EtcdServer) IsRaftLoopBlocked to support checking whether the raft loop is blocked #16710

Are you sure you want to change the base?

Add method (*EtcdServer) IsRaftLoopBlocked to support checking whether the raft loop is blocked #16710

Conversation

ahrtr commented Oct 8, 2023

ahrtr commented Oct 8, 2023

chaochn47 commented Oct 9, 2023 • edited Loading

chaochn47 commented Oct 9, 2023

ahrtr commented Oct 9, 2023

chaochn47 commented Oct 9, 2023

siyuanfoundation commented Oct 9, 2023

chaochn47 commented Oct 10, 2023

stale bot commented Mar 17, 2024

Add method `(*EtcdServer) IsRaftLoopBlocked` to support checking whether the raft loop is blocked #16710

Add method `(*EtcdServer) IsRaftLoopBlocked` to support checking whether the raft loop is blocked #16710

chaochn47 commented Oct 9, 2023 •

edited

Loading