Add initial Tracing with OpenTelemetry #12919

lilic · 2021-05-05T15:35:35Z

This PR adds initial tracing using OpenTelemetry, as per discussion on the #12460. Let me know if we want this merged in 3.5, I believe it still adds value and observability with some tradeoffs to resource usage, that I detailed below.

This introduces:

Setups up OpenTelemetry tracing.
New experimental flags for Tracing which for enabling tracing, specifying OpenTelemetry collector address and customizing service name. Usage:
etcd --experimental-enable-tracing --experimental-tracing-address="0.0.0.0:55680" --experimental-tracing-service-name="etcd-new"
Only if the experimental-enable-tracing is enabled does tracing of the gRPC start collecting.
This is how the traces look like visulazed in jaeger UI:

And the individual gRPC request span:

Performance/resource usage

I ran an initial small load with tracing on and found that overall on average it adds a 1.5% - 4% CPU overhead:

This is the merged profile during the total load test: https://share.polarsignals.com/d92bfce/, we can see under the icicle graph there that the otelgrpc.UnaryServerInterceptor function was 1.84% of the total CPU. Plus extra 0.84% CPU time in the otelgrpc.messageType.Event.

In some individual profiles, we can see around 4.71% CPU overhead. https://share.polarsignals.com/e5d524c/

Happy to share more profiles, if needed, or run more load tests!

Open Questions

Do we need to add tracing to the grpc-proxy as well?
OpenTelemetry go client is not yet GA or stable, between the last two minor releases there were some breaking changes, is that okay for etcd? Can we bump the version of this in a patch etcd release or after code freeze?

Follow up improvements and features:

I will add some documentation if we decide to merge this in a follow-up PR. And mention the above resource overhead to users.
We can add more finite traces for various parts of the codebase in the future.
More configuration options in the future, like sampling configuration.

lilic · 2021-05-05T15:36:40Z

@ptabor @dashpole @logicalhan @hexfusion please have a look, thank you!

go.sum

dims · 2021-05-05T17:17:59Z

@lilic just to be sure zero cost if experimental-enable-tracing is false (which is default) right?

hexfusion

Thanks, @lilic overall I think this is a nice feature to try and move forward into 3.5. apiserver[1] is working on this as well so the pairing will be an important part of the observability story for k8s. While the performance hit is unfortunate documenting it as such and gating under an experimental flag seems appropriate. My understanding is that there are continuing efforts to improve performance. +1

[1] kubernetes/kubernetes#94942

hexfusion · 2021-05-05T16:49:47Z

server/embed/config.go

+	ExperimentalTracingAddress string `json:"experimental-tracing-address"`
+	// ExperimentalTracingServiceName is the name of the service.
+	// Can only be used if ExperimentalEnabledTracing is true.
+	ExperimentalTracingServiceName string `json:"experimental-tracing-service-name"`


would it make sense to use etcd name and reduce flag footprint?

Having the ability to specify the name allows you to collect the traces from all the running instances of etcd separately, so you can detect issues per running instance of etcd, this way you can in your UI select just the instance you had an issue with and troubleshoot.

Having the ability to specify the name allows you to collect the traces from all the running instances of etcd separately, so you can detect issues per running instance of etcd, this way you can in your UI select just the instance you had an issue with and troubleshoot.

Can we have this as flag description or some format of doc :)

e.g.,

// ExperimentalTracingServiceName is the tracing service name for // OpenTelemetry (if enabled) -- "etcd" is the default service name. // When shared, all telemetry data are aggregated under the same namespace. // Use different names in order to collect data per each node.

(if I understand your comment correctly)

That is a good idea, will add it there!

The plan was to also add some docs if PR is merged, maybe similar to where metrics docs are right now.

Having the ability to specify the name allows you to collect the traces from all the running instances of etcd separately, so you can detect issues per running instance of etcd, this way you can in your UI select just the instance you had an issue with and troubleshoot.

If I understand what your saying this name should be unique per etcd? This is why I was wondering if using etcd member name[1] which is unique for the cluster would make sense maybe as a default.

[1]https://github.com/etcd-io/etcd/blob/master/server/config/config.go#L37

Oh I like that idea.

Yes should be unique per etcd process/member.

Missed this comment sorry, nice will use that as default! But I think we should still let users override it.

On second thought, if the name is not passed it defaults to... default. IMHO this makes the discoverability out of the box in the tracing UI a bit harder, when the service is called default. Should we stick to Name as default or constant"etcd"?

how about if the name is default use etcd otherwise use name and allow override with flag?

server/go.mod

server/embed/etcd.go

lilic · 2021-05-06T06:47:49Z

just to be sure zero cost if experimental-enable-tracing is false (which is default) right?

Exactly, we only ever enable OTel tracing if that flag is passed yes.

You can see this more clearly here https://share.polarsignals.com/442c0d2/ I ran some load when the experimental-enable-tracing flag was passed and the same load when the flag was not enabled, you can see the CPU diff if you look at the top view and search for telemetry.

gyuho

Looks great overall. Requesting some minor changes. Thanks!

server/config/config.go

server/embed/config.go

server/etcdmain/config.go

gyuho · 2021-05-06T09:14:56Z

server/embed/config.go

+	ExperimentalTracingAddress string `json:"experimental-tracing-address"`
+	// ExperimentalTracingServiceName is the name of the service.
+	// Can only be used if ExperimentalEnabledTracing is true.
+	ExperimentalTracingServiceName string `json:"experimental-tracing-service-name"`


Having the ability to specify the name allows you to collect the traces from all the running instances of etcd separately, so you can detect issues per running instance of etcd, this way you can in your UI select just the instance you had an issue with and troubleshoot.

Can we have this as flag description or some format of doc :)

e.g.,

// ExperimentalTracingServiceName is the tracing service name for // OpenTelemetry (if enabled) -- "etcd" is the default service name. // When shared, all telemetry data are aggregated under the same namespace. // Use different names in order to collect data per each node.

(if I understand your comment correctly)

codecov-commenter · 2021-05-06T10:57:41Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 12.28070% with 50 lines in your changes missing coverage. Please review.

Project coverage is 72.77%. Comparing base (a8f38eb) to head (3cdd242).

Files with missing lines	Patch %	Lines
server/embed/etcd.go	2.08%	45 Missing and 2 partials ⚠️
server/etcdserver/api/v3rpc/grpc.go	40.00%	2 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12919      +/-   ##
==========================================
- Coverage   73.12%   72.77%   -0.36%     
==========================================
  Files         430      430              
  Lines       34185    34238      +53     
==========================================
- Hits        24998    24915      -83     
- Misses       7252     7376     +124     
- Partials     1935     1947      +12

Flag	Coverage Δ
all	`72.77% <12.28%> (-0.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gyuho · 2021-05-06T11:26:36Z

@lilic On second thought. tracing is not descriptive enough, given that we do already have tracing within etcd process. I think we should name it something like distributed-tracing?

lilic · 2021-05-06T11:56:31Z

I think that is a good idea yes, for clarity! So we would have everything prefixed with experimental-distributed-tracing-*.

server/config/config.go

server/embed/etcd.go

server/embed/config.go

lilic · 2021-05-06T16:17:41Z

I noticed the help text file (server/etcdmain/help.go) , is there a way to generate that file or is it manually edited?

logicalhan

lgtm :)

gyuho · 2021-05-06T16:46:14Z

I noticed the help text file (server/etcdmain/help.go) , is there a way to generate that file or is it manually edited?

We manually edit :)

gyuho

lgtm, thanks!

dashpole · 2021-05-06T17:15:35Z

server/embed/etcd.go

+		tracesdk.WithSyncer(exporter),
+		tracesdk.WithResource(res),
+	)
+	otel.SetTracerProvider(tp)


Just to call this out to reviewers less familiar with opentelemetry: This is setting a global variable, which is used by tracing libraries. In the kubernetes APIServer, we will likely avoid doing this by explicitly passing the TracerProvider to the libraries we want to do tracing. It is more code, but avoids global variables. Not saying you should do it one way or another, but wanted to point it out.

Thanks! Agreed on globals! 💯 Right now we are not creating any indivual spans only using the gRPC so I don't see it being different, but I could be wrong?

If, for example, a library you rely on added OpenTelemetry tracing, you would start getting traces from that library without changing anything on your end, since it will look up the global TracerProvider to see where to send traces. That may or may not be desirable. Also, if you have multiple different "services" in the same go binary, you can't customize the TracerProviders individually for each service while using the global TracerProvider. I think it is perfectly fine to start with using the global one, though.

Embed server is being used 'as a library' by our customers. Also in tests we create multiple instances of Embed server to mimic multiple etcd nodes. Recently we get rid of 'global loggers' (including grpc loggers) being overridden by each of the embed servers (leading to misleading zap-fields being reported).

I don't know how differently otl can be configured for different libs, but having lib-specific tracer seems to be on the safer side.

I am all for removing globals, I felt the pain many times with Prometheus global registry (which I would love to remove from etcd one day as well, if folks agree).

From what I can tell we can pass it via https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc#WithTracerProvider, so will do that.

Side note here: @dashpole I would love to see the default examples have all the knowledge you shared here, as all the various tutorials and examples out there for OTel use global registry and essentially do what I did initially. I fear this will cause some of the same issues as Prometheus has with the global registry (think etcd + apiserver). What do you think?

Agreed. I opened https://github.com/open-telemetry/opentelemetry-go/issues/1888

Embed server is being used 'as a library' by our customers. Also in tests we create multiple instances of Embed server to mimic multiple etcd nodes.

If the embedded server is used as a library it may be better to avoid use of the OTel SDK at all and only use the API. In that case the server should receive its TracerProvider and propagators using the API interfaces from the main package that has instantiated them with the SDK. That would allow the consumer of the embedded server to decide whether to enable tracing, which exporters and propagators to use, even which SDK to use if alternate implementations of the SDK become available.

server/embed/etcd.go

dashpole · 2021-05-06T18:06:13Z

server/embed/etcd.go

+		tracesdk.WithBatcher(exporter),
+		tracesdk.WithResource(res),
+	)
+	otel.SetTracerProvider(tp)


in addition to setting the global TracerProvider, you should also set the global propagators:

otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))

The TraceContext propagator will allow linking traces from the APIServer to etcd, and Baggage allows passing additional tags via context.

dashpole · 2021-05-06T18:14:34Z

One thing you don't do here, but probably should is add a sampling policy. The default is to ParentBased(AlwaysSample) (always sample unless there is an incoming parent context that is explicitly not sampled), which is quite verbose. Something like sampling 1% of requests is probably more appropriate. But that can easily be done in a follow-up

lilic · 2021-05-06T18:23:22Z

But that can easily be done in a follow-up

Yes agreed, this was a minimal PR and was also one of my questions in the PR description, if we can do follow-ups even in the patch releases that would include new configurations.

ptabor · 2021-05-06T20:45:07Z

The tests fail due to:

FAIL: inconsistent versions for depencency: github.com/golang/protobuf
  - github.com/golang/protobuf@v1.5.1 from: go.etcd.io/etcd/api/v3
  - github.com/golang/protobuf@v1.5.1 from: go.etcd.io/etcd/raft/v3
  - github.com/golang/protobuf@v1.5.2 from: go.etcd.io/etcd/server/v3
  - github.com/golang/protobuf@v1.5.2 from: go.etcd.io/etcd/tests/v3
FAIL: inconsistent versions for depencency: google.golang.org/grpc
  - go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.20.0 from: go.etcd.io/etcd/server/v3
  - google.golang.org/grpc@v1.36.1 from: go.etcd.io/etcd/api/v3
  - google.golang.org/grpc@v1.36.1 from: go.etcd.io/etcd/client/pkg/v3
  - google.golang.org/grpc@v1.36.1 from: go.etcd.io/etcd/client/v3
  - google.golang.org/grpc@v1.36.1 from: go.etcd.io/etcd/pkg/v3
  - google.golang.org/grpc@v1.37.0 from: go.etcd.io/etcd/etcdctl/v3
  - google.golang.org/grpc@v1.37.0 from: go.etcd.io/etcd/server/v3
  - google.golang.org/grpc@v1.37.0 from: go.etcd.io/etcd/tests/v3
  - google.golang.org/grpc@v1.37.0 from: go.etcd.io/etcd/v3
FAIL: inconsistent dependencies

Should get fixed by:

./scripts/update_dep.sh  github.com/golang/protobuf v1.5.2
./scripts/update_dep.sh google.golang.org/grpc v1.37.0

server/embed/etcd.go

server/etcdmain/config.go

lilic · 2021-05-07T13:15:58Z

Okay, this should be ready for another look:

I added ability of setting service instance ID which should be the unique key, with service name being always the same (added comments)
Removed the global setting and I instead pass the tracer around to the gRPC.
Various other safe guard checks.

Please have another look, thank you!

TODO from my side:

Change help text once we decide on the flag names and default values as its manual process.
Add a changelog entry?
Will add some docs in a follow-up PR.

server/config/config.go

hexfusion · 2021-05-07T13:34:55Z

server/embed/etcd.go

+	)
+	// As Tracing service Instance ID must be unique, it should
+	// never use the empty default string value, so we only set it
+	// if it's a non empty string.


so if Instance ID is empty string is that an error?

As this field is optional in OTel I wouldn't return any errors just skip. We just don't add it, as that is our default value, but also as an empty string can never be unique and the ID must be unique. Happy to do another approach here well, suggestions welcome!

yeah I think driving reasonable defaults from the current member would be a nice option. If the flag is not required folks will leave it blank and then get errors later on. as noted above etcd name or peerURL could be considered.

this could be a followup

etcd name is not unique per etcd instance, at least not by default (default :) ). But happy to do peerURL.

There tend to be two approaches from what I have seen, either nothing by default or in some places, hostname is what some projects use the default.

server/embed/etcd.go

dashpole

lgtm!

ptabor · 2021-05-07T16:07:47Z

./scripts/fix.sh needed to refresh bom

lilic · 2021-05-10T08:46:27Z

Thank you all, some new changes: I ran the above script, added the changelog entries and the help text for the flags. Please have another look.

gyuho

lgtm. Let's follow up in separate PRs for other changes mentioned here.

lilic commented May 5, 2021

View reviewed changes

go.sum Show resolved Hide resolved

hexfusion reviewed May 5, 2021

View reviewed changes

dims reviewed May 5, 2021

View reviewed changes

server/embed/etcd.go Show resolved Hide resolved

lilic force-pushed the add-opentel-tracing branch 3 times, most recently from 5f89e13 to 7853fb6 Compare May 6, 2021 09:04

gyuho suggested changes May 6, 2021

View reviewed changes

logicalhan reviewed May 6, 2021

View reviewed changes

server/config/config.go Outdated Show resolved Hide resolved

server/embed/etcd.go Outdated Show resolved Hide resolved

lilic force-pushed the add-opentel-tracing branch from 7853fb6 to 9d5569c Compare May 6, 2021 16:12

lilic commented May 6, 2021

View reviewed changes

server/embed/config.go Show resolved Hide resolved

logicalhan reviewed May 6, 2021

View reviewed changes

lilic force-pushed the add-opentel-tracing branch from 9d5569c to bcbd931 Compare May 6, 2021 16:35

gyuho approved these changes May 6, 2021

View reviewed changes

dashpole reviewed May 6, 2021

View reviewed changes

lilic force-pushed the add-opentel-tracing branch from bcbd931 to e3c9901 Compare May 6, 2021 17:49

dashpole reviewed May 6, 2021

View reviewed changes

ptabor reviewed May 6, 2021

View reviewed changes

server/embed/etcd.go Outdated Show resolved Hide resolved

hexfusion approved these changes May 7, 2021

View reviewed changes

lilic force-pushed the add-opentel-tracing branch from e3c9901 to 1c9e342 Compare May 7, 2021 13:01

lilic commented May 7, 2021

View reviewed changes

server/etcdmain/config.go Outdated Show resolved Hide resolved

lilic force-pushed the add-opentel-tracing branch from 4cecfc2 to ce9b6db Compare May 7, 2021 13:07

lilic changed the title ~~WIP: Add initial Tracing with OpenTelemetry~~ Add initial Tracing with OpenTelemetry May 7, 2021

hexfusion reviewed May 7, 2021

View reviewed changes

server/config/config.go Outdated Show resolved Hide resolved

hexfusion reviewed May 7, 2021

View reviewed changes

lilic force-pushed the add-opentel-tracing branch from ce9b6db to a1459f6 Compare May 7, 2021 13:46

dashpole mentioned this pull request May 7, 2021

Add examples that don't use global variables. open-telemetry/opentelemetry-go-contrib#6385

Closed

dashpole reviewed May 7, 2021

View reviewed changes

server/embed/etcd.go Outdated Show resolved Hide resolved

lilic force-pushed the add-opentel-tracing branch from a1459f6 to 037421f Compare May 7, 2021 15:37

dashpole approved these changes May 7, 2021

View reviewed changes

ptabor approved these changes May 7, 2021

View reviewed changes

lilic force-pushed the add-opentel-tracing branch from 037421f to 662cb11 Compare May 10, 2021 08:15

lilic added 2 commits May 10, 2021 10:44

Add initial Tracing with OpenTelemetry

1a718a9

CHANGELOG-3.5.md: Add Tracing entry

3cdd242

lilic force-pushed the add-opentel-tracing branch from 6f939e0 to 3cdd242 Compare May 10, 2021 08:45

lilic mentioned this pull request May 10, 2021

Replace Prometheus global registry with non-global custom registry #12937

Closed

gyuho approved these changes May 10, 2021

View reviewed changes

gyuho merged commit 5bad818 into etcd-io:master May 10, 2021

lilic deleted the add-opentel-tracing branch May 10, 2021 14:15

sallyom mentioned this pull request May 11, 2021

Configure OpenTelemetry Tracing cri-o/cri-o#4883

Merged

lilic mentioned this pull request May 12, 2021

[3.5] Add initial docs for experimental distributed tracing feature etcd-io/website#280

Closed

lilic mentioned this pull request Jun 10, 2021

content/en/docs/next/op-guide/monitoring.md: Add distributed tracing etcd-io/website#373

Merged

dashpole mentioned this pull request Jun 16, 2021

Apiserver tracing kubernetes/kubernetes#94942

Merged

dashpole mentioned this pull request Jun 25, 2021

Add distributed tracing to the etcd client kubernetes/kubernetes#103216

Merged

srenatus mentioned this pull request Nov 19, 2021

Add OpenTracing/Census/Telemetry support for OPA API's open-policy-agent/opa#1469

Closed

dashpole mentioned this pull request Aug 4, 2022

[v3.5] Change the default sampling rate from 100% to 0% #14310

Closed

chaochn47 mentioned this pull request Oct 13, 2023

Resolve CVE-2023-44487 #16740

Closed

24 tasks

Add initial Tracing with OpenTelemetry #12919

Add initial Tracing with OpenTelemetry #12919

Conversation

lilic commented May 5, 2021

This introduces:

Performance/resource usage

Open Questions

Follow up improvements and features:

lilic commented May 5, 2021

dims commented May 5, 2021

hexfusion left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilic commented May 6, 2021

gyuho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 6, 2021 • edited Loading

Codecov Report

gyuho commented May 6, 2021

lilic commented May 6, 2021

lilic commented May 6, 2021

logicalhan left a comment

Choose a reason for hiding this comment

gyuho commented May 6, 2021

gyuho left a comment

Choose a reason for hiding this comment

dashpole May 6, 2021 • edited Loading

Choose a reason for hiding this comment

lilic May 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole commented May 6, 2021

lilic commented May 6, 2021

ptabor commented May 6, 2021

lilic commented May 7, 2021 • edited Loading

Choose a reason for hiding this comment

lilic May 7, 2021 • edited Loading

Choose a reason for hiding this comment

hexfusion May 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole left a comment

Choose a reason for hiding this comment

ptabor commented May 7, 2021

lilic commented May 10, 2021

gyuho left a comment

Choose a reason for hiding this comment

hexfusion left a comment •

edited

Loading

codecov-commenter commented May 6, 2021 •

edited

Loading

dashpole May 6, 2021 •

edited

Loading

lilic May 6, 2021 •

edited

Loading

lilic commented May 7, 2021 •

edited

Loading

lilic May 7, 2021 •

edited

Loading

hexfusion May 7, 2021 •

edited

Loading