Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: Add response byte size and range response count to took too long warning #9826

Merged
merged 1 commit into from
Jun 11, 2018

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Jun 8, 2018

Large range reads can result in "took too long" warnings. This adds range response counts as well as total proto response message size (in bytes) to the warnings to help debug situations where knowing the volume of data returned can help diagnose an issue.

put:

BEFORE:
2018-06-08 10:11:40.140673 W | etcdserver: read-only range request "key:\"/a\" range_end:\"/b\" " took too long (97.966µs) to execute

AFTER:
2018-06-08 10:11:40.140673 W | etcdserver: read-only range request "key:\"/a\" range_end:\"/b\" " with result "range_response_count:3 size:96" took too long (97.966µs) to execute

txn:

BEFORE:
etcdserver: read-only range request "compare:<target:VALUE key:\"a\" value:\"1\" > success:<request_range:<key:\"/a\" range_end:\"/b\" > > failure:<request_range:<key:\"/a\" range_end:\"/b\" > > " took too long (68.588µs) to execute

AFTER:
etcdserver: read-only range request "compare:<target:VALUE key:\"a\" value:\"1\" > success:<request_range:<key:\"/a\" range_end:\"/b\" > > failure:<request_range:<key:\"/a\" range_end:\"/b\" > > " with result "responses:<<range_response_count:3>> size:104" took too long (68.588µs) to execute

cc @gyuho @wenjiaswe @mml

@@ -114,7 +114,7 @@ func (a *applierV2store) Sync(r *RequestV2) Response {
// applyV2Request interprets r as a call to v2store.X
// and returns a Response interpreted from v2store.Event
func (s *EtcdServer) applyV2Request(r *RequestV2) Response {
defer warnOfExpensiveRequest(s.getLogger(), time.Now(), r)
defer warnOfExpensiveRequest(s.getLogger(), time.Now(), r, nil, nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v3 only acceptable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*RequestV2 also satisfies proto.Message interface, but I don't mind just passing nil for v2 API (since this is a new feature, no need to worry about v2 behavior).

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

@gyuho gyuho added the WIP label Jun 8, 2018
@gyuho
Copy link
Contributor

gyuho commented Jun 8, 2018

Marking as WIP until we figure out CI failures.

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 8, 2018

Reviewing a problem with proto.Size from integration tests:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xa90726]

github.com/coreos/etcd/etcdserver/etcdserverpb.(*PutResponse).Size(0x0, 0x4cac5b)
        /usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:8148 +0x26
github.com/coreos/etcd/etcdserver/etcdserverpb.(*PutResponse).Marshal(0x0, 0x176af60, 0x0, 0x7fbfb4a53af0, 0x0, 0x1)
        /usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:5161 +0x2f
github.com/coreos/etcd/vendor/github.com/golang/protobuf/proto.Size(0x176af60, 0x0, 0xc420117880)
        /usr/local/google/home/jpbetz/Projects/etcd/src/github.com/coreos/etcd/vendor/github.com/golang/protobuf/proto/table_marshal.go:2602 +0x100
github.com/coreos/etcd/etcdserver.warnOfExpensiveRequest(0xc428750060, 0xbebecd857a50c3e6, 0xdde501b3a, 0x17c4620, 0x17604e0, 0xc4203a88c0, 0x176af60, 0x0, 0x175f920, 0xc42014db10)

https://github.com/coreos/etcd/blob/c6e18c712c7cc336a86404478297afa6063a72eb/etcdserver/etcdserverpb/rpc.pb.go#L8148

https://github.com/coreos/etcd/blob/c6e18c712c7cc336a86404478297afa6063a72eb/etcdserver/etcdserverpb/rpc.pb.go#L5161

It appears that somehow Marshal is called on a nil pointer here:

https://github.com/golang/protobuf/blob/b4deda0973fb4c70b50d226b1af49f3da59f5265/proto/table_marshal.go#L2602

@gyuho
Copy link
Contributor

gyuho commented Jun 8, 2018

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 8, 2018

I've put a workaround for the proto.Size panic. I'm not thrilled with it, but it still gives us more data than we had before.

Integration tests are now passing. I'll wait for CI.

if r := recover(); r != nil {
size = "proto_size_failure"
}
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work?

import "github.com/golang/protobuf/proto"

ms, ok := msg.(proto.Marshaler)
if ms == nil || !ok {
  return "proto_size_failure"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no, I just tried it and got the same error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then, let's use recover for now, and revert it back to upstream proto.Size later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Too bad we can't have a clean PR for this improvement. I'll track the upstream issue and propose a PR if needed.

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 8, 2018

Just waiting on a clean semaphoreci run now.

@gyuho gyuho removed the WIP label Jun 8, 2018
@gyuho
Copy link
Contributor

gyuho commented Jun 8, 2018

@jpbetz I noticed that proto.Message can be non-nil even if the original typed object was nil.

In TestCtlV3Compact, we are passing *etcdserverpb.CompactionResponse nil object as proto.Message interface, but once it's converted to proto.Message, the interface proto.Message itself is not nil since the original object still points to something.

 // safeSize attempt to calculate size using proto.Size, but recovers from panics and returns "proto_size_failure" in those cases.
 // TODO: Remove once https://github.com/golang/protobuf/issues/631 is resolved.
 func safeSize(msg proto.Message) (size string) {
+       fmt.Println("safeSize:", msg == nil, reflect.ValueOf(msg).IsNil())

This will print out safeSize: false true.

So, it will panic in proto.Marshaler.Marshal.

func (m *CompactionResponse) Marshal() (dAtA []byte, err error) {
	size := m.Size()

@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 8, 2018

@gyuho Thanks for hunting this down. I'd been warned before of this gotcha with interfaces! Fixing now.

@wenjiaswe
Copy link
Contributor

Very useful information. LGTM.

@jpbetz jpbetz force-pushed the response_sizes branch 2 times, most recently from adf989f to df47105 Compare June 9, 2018 04:46
@jpbetz
Copy link
Contributor Author

jpbetz commented Jun 9, 2018

proto.Size calls were indeed fixed by checking for non-nil proto.Message interfaces with nil value pointers per @gyuho's suggestion. Also fixed a mistake in how the error handling code in Range was rearranged to instrument measuring response sizes. Once tests are green this is ready for merge.

func warnOfExpensiveRequest(lg *zap.Logger, now time.Time, reqStringer fmt.Stringer, respMsg proto.Message, err error) {
var resp string
if !isNil(respMsg) {
resp = fmt.Sprintf("size:%s", proto.Size(respMsg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks. The safeSize function I used before returned a string. Fixing.

func warnOfExpensiveReadOnlyRangeRequest(lg *zap.Logger, now time.Time, reqStringer fmt.Stringer, rangeResponse *pb.RangeResponse, err error) {
var resp string
if !isNil(rangeResponse) {
resp = fmt.Sprintf("range_response_count:%d size:%s", len(rangeResponse.Kvs), proto.Size(rangeResponse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"range_response_count:%d size:%d"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// only range responses should be in a read only txn request
}
}
resp = fmt.Sprintf("responses:<%s> size:%s", strings.Join(resps, " "), proto.Size(txnResponse))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"responses:<%s> size:%d"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants