Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: add learner metrics #10731

Merged
merged 1 commit into from
Jun 9, 2019
Merged

Conversation

WIZARD-CXY
Copy link
Contributor

@WIZARD-CXY WIZARD-CXY commented May 16, 2019

@jingyih @xiang90 Add learner metrics.
I suppose we will get this etcd_debugging_learner_promote_failures and etcd_debugging_learner_promote_success from leader.

etcd_server_is_learner

# name: "etcd_debugging_learner_promote_failures"
# description: "The total number of learner promote failures (likely learner not ready)."
Copy link
Contributor

@jingyih jingyih May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this counter will only get updated if the local member is leader, right? Consider a user sending promote requests to a server which is not the leader at the moment. Then the user checks metrics, and sees 0 promote failures and 0 promote success. Feels like the description needs to be clearer to avoid user confusion. Maybe 'The total number of learner promotion failures while the server is leader'?

Also there are 3 possible reasons that a promote could fail, maybe we should not suggest it is likely to be one of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point. user can use prometheus sum query to get the total number of the failures. As a debugging metrics I think it is fine add failure reasons as the metrics label. @jingyih

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@WIZARD-CXY
Copy link
Contributor Author

@jingyih seems I need part3 to be added #10730 in the upstream.Now the relating code is not in the master yet.

@jingyih
Copy link
Contributor

jingyih commented May 17, 2019

@jingyih seems I need part3 to be added #10730 in the upstream.Now the relating code is not in the master yet.

Maybe just wait for #10730? Hopefully it will get merged soon. BTW KubenCon is starting soon, it might cause some delay in reviews.

@WIZARD-CXY
Copy link
Contributor Author

@jingyih @xiang90 kindly ping

@gyuho
Copy link
Contributor

gyuho commented May 28, 2019

Is this still WIP? This PR only contains doc change.

add failure reasons as the metrics label

This will be useful. Can we update this PR with the actual code change?

@gyuho
Copy link
Contributor

gyuho commented May 28, 2019

We also need changelog (maybe @jingyih can collect all the changes at once in one PR).

@jingyih
Copy link
Contributor

jingyih commented May 28, 2019

Is this still WIP? This PR only contains doc change.

add failure reasons as the metrics label

This will be useful. Can we update this PR with the actual code change?

This is blocked by #10730.

@jingyih
Copy link
Contributor

jingyih commented May 28, 2019

We also need changelog (maybe @jingyih can collect all the changes at once in one PR).

I will update the changelog once #10730 is finalized and merged.

@WIZARD-CXY WIZARD-CXY changed the title etcdserver: add learner metrics [WIP]etcdserver: add learner metrics May 29, 2019
@jingyih
Copy link
Contributor

jingyih commented May 29, 2019

#10730 merged. Please update PR. @WIZARD-CXY

@WIZARD-CXY
Copy link
Contributor Author

@jingyih will do

@WIZARD-CXY WIZARD-CXY force-pushed the learner_metric branch 4 times, most recently from b6be3cd to 819a554 Compare June 3, 2019 03:43
@WIZARD-CXY
Copy link
Contributor Author

WIZARD-CXY commented Jun 3, 2019

@jingyih I check the code and find that it is hard to monitor whether the member is a learner or not real time. The code now is not ready for this yet. PTAL?

@jingyih
Copy link
Contributor

jingyih commented Jun 3, 2019

@jingyih I check the code and find that it is hard to monitor whether the member is a learner or not real time.

Can we update the isLearner metrics here:

case raftpb.ConfChangeAddNode, raftpb.ConfChangeAddLearnerNode:

Whenever there is an raft entry to add node / add learner node / promote node, and the entry's node ID is the same as local member ID, update learner status.

@WIZARD-CXY
Copy link
Contributor Author

@jingyih I check the code and find that it is hard to monitor whether the member is a learner or not real time.

Can we update the isLearner metrics here:

case raftpb.ConfChangeAddNode, raftpb.ConfChangeAddLearnerNode:

Whenever there is an raft entry to add node / add learner node / promote node, and the entry's node ID is the same as local member ID, update learner status.

good idea!

@WIZARD-CXY
Copy link
Contributor Author

WIZARD-CXY commented Jun 4, 2019

@jingyih ptal. Changed according to your suggestions and I tested on my local machine. I set up a 3node cluster and add one learner. below is part of merics result. test passed.

➜  ~ curl http://127.0.0.1:2379/metrics 2>&1 |grep learner
# HELP etcd_server_is_learner Whether or not this member is a learner. 1 if is, 0 otherwise.
# TYPE etcd_server_is_learner gauge
etcd_server_is_learner 0
# HELP etcd_server_learner_promote_successes The total number of successful learner promotions while this member is leader.
# TYPE etcd_server_learner_promote_successes counter
etcd_server_learner_promote_successes 0
➜  ~ curl http://127.0.0.1:23790/metrics 2>&1 |grep learner
# HELP etcd_server_is_learner Whether or not this member is a learner. 1 if is, 0 otherwise.
# TYPE etcd_server_is_learner gauge
etcd_server_is_learner 0
# HELP etcd_server_learner_promote_successes The total number of successful learner promotions while this member is leader.
# TYPE etcd_server_learner_promote_successes counter
etcd_server_learner_promote_successes 0
➜  ~ curl http://127.0.0.1:22379/metrics 2>&1 |grep learner 
# HELP etcd_server_is_learner Whether or not this member is a learner. 1 if is, 0 otherwise.
# TYPE etcd_server_is_learner gauge
etcd_server_is_learner 0
# HELP etcd_server_learner_promote_failures The total number of learner promote failures (likely learner not ready) while this member is leader.
# TYPE etcd_server_learner_promote_failures counter
etcd_server_learner_promote_failures{Reason="membership: ID not found"} 1
etcd_server_learner_promote_failures{Reason="membership: can only promote a learner member"} 1
# HELP etcd_server_learner_promote_successes The total number of successful learner promotions while this member is leader.
# TYPE etcd_server_learner_promote_successes counter
etcd_server_learner_promote_successes 1

@WIZARD-CXY WIZARD-CXY force-pushed the learner_metric branch 2 times, most recently from 4bb39ad to 86e168f Compare June 4, 2019 02:51
Copy link
Contributor

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after nits.

docs/metrics/latest Outdated Show resolved Hide resolved
@WIZARD-CXY
Copy link
Contributor Author

@jingyih updated ptal

etcdserver/metrics.go Outdated Show resolved Hide resolved
@jingyih
Copy link
Contributor

jingyih commented Jun 5, 2019

lgtm

@jingyih
Copy link
Contributor

jingyih commented Jun 5, 2019

Can you send another PR to add the new metrics to the changelog 3.4?
https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.4.md#metrics-monitoring

@WIZARD-CXY
Copy link
Contributor Author

@jingyih consider it done

@WIZARD-CXY WIZARD-CXY changed the title [WIP]etcdserver: add learner metrics etcdserver: add learner metrics Jun 8, 2019
@jingyih jingyih merged commit 48d144a into etcd-io:master Jun 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants