Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Context to stop tracing #530

Open
pauldraper opened this issue Mar 28, 2020 · 34 comments
Open

Use Context to stop tracing #530

pauldraper opened this issue Mar 28, 2020 · 34 comments
Labels
area:api Cross language API specification issue maintainer-request Escalated by SIG maintainers spec:context Related to the specification/context directory spec:trace Related to the specification/trace directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor

Comments

@pauldraper
Copy link

pauldraper commented Mar 28, 2020

There are circumstances for instrumented actions where:

  1. The action is frequent and of low interest: a healthcheck, polling a message queue, etc.
  2. An OpenTracing exporter uses libraries that themselves may be instrumented (risking infinite tracing).
  3. If the current layer (e.g. RPC) happens to offer sufficiently detailed tracing, lower HTTP/DNS/TCP/UDP layers do not need to be traced when invoked with this library. This need is heighten by the use of server/client spans, which AFAIK needs to remove spans in between them. (SpanKind with layers #526).

(1) has come up in #173

(2) has been a problem in opentelemetry-js (open-telemetry/opentelemetry-js#332) HTTP-based exporters, since the Node.js stdlib is instrumented globally. The solution in that has was adding a special HTTP header x-opentelemetry-outgoing-request headers, that the http instrumentation ignores. In essence it uses HTTP headers as a poor-man's context API. (This may not scale to other APIs.)


It would be nice to cut off all automatic tracing "below this current scope" by setting a context key. The default tracer implementation would create no-op spans if the context disables spans.

EDIT: Case (1) may call for sampling, rather than full disabling.

@dyladan
Copy link
Member

dyladan commented Mar 28, 2020

OpenTelemetry Ruby does this and I really like the idea

@Oberon00
Copy link
Member

See the Python solution, which I think is rather elegant: open-telemetry/opentelemetry-python#181 (updated in open-telemetry/opentelemetry-python#395 for OTEP 66).

You introduce a conventional context variable "suppres_instrumentation" that all instrumentations check and the span processors set to true while exporting.

@dyladan
Copy link
Member

dyladan commented Mar 30, 2020

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

@Oberon00
Copy link
Member

Oberon00 commented Mar 30, 2020

That would also work, but personally I prefer the more explicit no-magic approach here. Also, the instrumentation overhead can be further reduced if the instrumentation checks manually and shortcuts, instead of doing all the usual instrumentation with a no-op Span. Checking Span.IsRecording would probably work in most cases though.

@dyladan
Copy link
Member

dyladan commented Mar 30, 2020

I think that is the primary use for the isRecording flag. I just prefer to only have to implement such things in a single place so that there is less risk of forgetting. I'm playing around with such an implementation in JS right now and I set the context in the span processor, then return non-recording spans from the tracer if that context is set. It works well and required very few changes.

@Oberon00
Copy link
Member

@toumorokoshi @codeboten FYI

@pauldraper
Copy link
Author

pauldraper commented Mar 30, 2020

I agree with @dyladan .

(A) The tracer is already managing context (activeSpan, withActiveSpan).

(B) Fewer places is better.

@Oberon00
Copy link
Member

@pauldraper The tracer will stop "managing" context but it will still need to access the context when creating a new span, see #527

@toumorokoshi
Copy link
Member

Hi, another opentelemetry-python contributor chiming in, thanks @Oberon00!

Why do the instrumentations check? Would you not just have the tracer check the context and return a non-reporting span?

I think have the tracer check the flag is a good idea. I don't think it's too much magic, and comes at the benefit of not having to require instrumentation authors to implement this pattern. We can control this behavior on the sdk level, which enables stronger guarantees on behavior consistency.

I think that is the primary use for the isRecording flag.

It doesn't look like that's the intention of IsRecording, at least from what I see in the spec:

Returns true if this Span is recording information like events with the AddEvent operation, attributes using SetAttributes, status with SetStatus, etc. There should be no parameter.

This flag SHOULD be used to avoid expensive computations of a Span attributes or
events in case when a Span is definitely not recorded. Note that any child
span's recording is determined independently from the value of this flag
(typically based on the sampled flag of a TraceFlag on
SpanContext).

I haven't seen examples of these expensive computations yet, but it seems unrelated to whether the span itself should be exported. I would argue for a separate context variable (suppressInstrumentation to borrow our current python convention).

The additional pro/con of a suppressInstrumentation flag separate from the Span itself is the fact that both metrics and spans can share that value. But maybe, it would be better if we had those as two separate flags as most likely you will only want to suppress one or the other, most likely in the exporter.

So maybe:

suppressTraceInstrumentation and suppressMetricInstrumentation?

@Oberon00
Copy link
Member

Oberon00 commented Apr 1, 2020

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

@dyladan
Copy link
Member

dyladan commented Apr 1, 2020

@toumorokoshi I was saying the primary use of isRecording is to avoid expensive operations to gather span attributes. @Oberon00 had recommended checking the context for instrumentation disabled flag within instrumentations because it would also avoid said expensive computations.

In my opinion the "correct" implementation on the instrumentation side would be:

  1. instrumentation calls startSpan
  2. tracer sees instrumentation disabled, returns a non-recording span
  3. instrumentation sees span is non-recording and avoids expensive operations

This avoids the need for every instrumentation to check this flag. Because there will be many instrumentations, we cannot guarantee none of them will forget.

The flag is set on the context by span processors before they call exporters, or optionally in the exporter. I have no strong opinion on this difference.

If an instrumentation wants to disable the instrumentation of the lower-level protocols, it may also set the flag. e.g. some database driver which uses http to talk to the database may want to disable the http instrumentation in favor of instrumenting the protocol they've built on top of it.

If starting the span itself requires one of these expensive operations, then nothing is preventing the instrumentations from also checking the context flag.

@toumorokoshi
Copy link
Member

toumorokoshi commented Apr 1, 2020

I haven't seen examples of these expensive computations yet

You have to look at everything the instrumentation does for each span. E.g. the sum of all the things that the Python WSGI instrumentation does per request is relatively expensive.

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes:

https://github.com/open-telemetry/opentelemetry-python/blob/master/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me. Thank you for clarifying.

In my opinion the "correct" implementation on the instrumentation side would be:

Make sense. As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

Although it makes sense to me intuitively, I don't see a call-out for a no-op or default span object that would not emit spans to spanprocessors in the specification. So maybe that's something that should be added as well.

@dyladan
Copy link
Member

dyladan commented Apr 2, 2020

I think I understand. You're referring to adding a check, per integration, to see if the IsRecording flag is set. if it's false, then we avoid adding attributes altogether, which would skip a non-trivial amount of code to collect and add attributes

yes

I think if the spec called out explicitly that the resulting span would actually be missing attributes that would have been added in an IsRecording scenario, that would have been a little clearer to me

A non-recording span doesn't record any attributes at all

As a clarification, is it just a non-recording span? It's a non-recording, no-op span that ensures that the span will never reach the exporter in the first place, correct?

A no-op span is one step further than a non-recording span. A non-recording span still propagates context, just doesn't record anything to the backend. It is useful in cases where you do not want to break traces. A no-op span can't be propagated, because it has no trace context or trace state. The non-recording span isn't sent to the backend, but propagates trace context correctly so that traces aren't broken. In this instance, a no-op span would be equally effective and possibly an even better choice.

@Oberon00
Copy link
Member

Oberon00 commented Apr 2, 2020

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

@toumorokoshi
Copy link
Member

@toumorokoshi Your link is actually perfect to illustrate why we additionally could use the "surpress_instrumentation" context variable: In https://github.com/open-telemetry/opentelemetry-python/blob/v0.6.0/ext/opentelemetry-ext-wsgi/src/opentelemetry/ext/wsgi/__init__.py#L121, attributes are collected to be passed to start_span (so the sampler has more info and knows e.g. the URL), so the is_recording property could only be queried after.

Ah! That's a very good point. I guess between the two, I would probably op to just use the context variable, which can then be wrapped in a utiltity method on the trace module itself like trace.instrumentationEnabled(). And I do agree that, in that case, that may make isRecording redundant.

@toumorokoshi
Copy link
Member

@pauldraper will a PR be filed on this? I like this approach and would love to see it in the spec so I can implement it.

@Oberon00
Copy link
Member

@trask I wonder, how does auto-instr-java deal with that? It has a gRPC instrumentation, would that not cause a feedback loop with the OTLP exporter currently?

@trask
Copy link
Member

trask commented May 25, 2020

In auto-instr-java, we load the exporter (and it's dependencies) inside of a separate class loader. This is primarily done to avoid version conflicts, but has the nice the side-effect that we can skip instrumentation of those classes (by looking at what classloader they're in) and avoid the feedback loop.

@trask
Copy link
Member

trask commented May 25, 2020

One interesting caveat to the above: open-telemetry/opentelemetry-java-instrumentation#375 (comment)

@anuraaga
Copy link
Contributor

For a reference point, I've run into 3) myself. I find spans from e.g., the netty instrumentation noise when they're being wrapped by a client library, for example the aws sdk. Others may not though meaning some configurability would be nice. But it creates the problem that

  1. With auto instrumentation, it wouldn't really be possible to disable netty instrumentation only when called from an RPC SDK, have to disable it globally

  2. Even when disabled, it causes issues where semantics depend on whether there is an underlying instrumentation or not like in Change aws-sdk span kind to INTERNAL opentelemetry-java-instrumentation#323

Being able to pause new spans for a stack frame would help with this.

@sirzooro
Copy link

This new instrumentationDisabled flag should be also propagated from process to process. Servers/Consumers could use it to automatically disable instrumentation for their spans.

My use case: multiple services uses common Redis library with added instrumentation. Service A periodically sends KeepAlive messages to other services, and I want to disable tracing for them on both Publisher and Subscriber sides. On Publisher side this is quite easy, I can tell Redis library skip instrumentation. However on Subscriber side this is not so easy, Redis library would have to examine messages to check if this is KeepAlive or not. One potential solution to it is to not send TraceParent/State from Publisher for KeepAlives, and create new span in Subscriber only if they are received. However this has a downside that spans are not created for messages sent from services which are not instrumented yet. Because of this it is better to explicitly send instrumentationDisabled flag.

@dyladan
Copy link
Member

dyladan commented May 27, 2022

@dyladan was there a decision around what the right way to move forward for suppressing instrumentations? I see the proposed solution of making a key available through the API was closed.

Nope. The PR was closed because there wasn't agreement and as far as I know nobody ever followed up. I stopped working on it because I have to admit I was feeling a little defeated.

This new instrumentationDisabled flag should be also propagated from process to process. Servers/Consumers could use it to automatically disable instrumentation for their spans.

This introduces serious trust concerns. Without somehow signing the request to ensure that it is trusted, this would be ripe for abuse. I would argue that paying to cost of inspecting the message to see that it is a keepAlive on the server side is a much simpler solution less prone to errors and abuse.

@Oberon00
Copy link
Member

@sirzooro If you need propagation of the flag, you are probably looking for the existing sampled flag, not this feature.

@sirzooro
Copy link

@sirzooro If you need propagation of the flag, you are probably looking for the existing sampled flag, not this feature.

Thanks, this is exactly what I was looking for.

@dyladan
Copy link
Member

dyladan commented Dec 13, 2022

The need for a mechanism like this was brought up again by @tsloughter in the spec call again today. At this point most if not all SIGs have some mechanism like the context solution that my PR proposed. At least JS, Ruby, Python, and .NET have all implemented something similar. Not sure which of those implemented in the API vs SDK, but there are definitely instrumentations in the wild using it in multiple languages.

In JS it was implemented as a context key which the SDK recognizes and provides a helper function to set and unset. IMO this is a poor situation because it tightly couples the instrumentation with the SDK. I would prefer to implement it in the API but did not want to do that without specification.

The proposal was a simple key which suppressed tracing by indicating a nonrecording span should be returned. This prevents exporter loops and allows higher level module instrumentations like axios to suppress lower level ones like http. Other signals were not needed to be suppressed because they don't have these problems (incrementing a metric on export does not cause another export like in tracing, and axios/http would have separate metric streams).

I would like to propose that we move forward with that original proposal. It solves the problems which need to be solved, is already implemented in many places, and does not preclude a more advanced per-signal, or any other, mechanism from being specified in the future. I think this is one place where we have let the desire for perfection get in the way of pragmatism.

@tsloughter
Copy link
Member

@dyladan by "original proposal" do you mean an existing PR? I was going to make one today and planned it to be a function like suppress_telemetry() which takes optional arguments to only suppress certain signals.

@dyladan
Copy link
Member

dyladan commented Dec 14, 2022

@tsloughter I meant this #1653

@alanwest
Copy link
Member

Other signals were not needed to be suppressed because they don't have these problems (incrementing a metric on export does not cause another export like in tracing, and axios/http would have separate metric streams).

Catching up a bit on the old conversation from #1653, it looks like the original proposal was to suppress all signals with this feature. Personally, I think that's a good thing. It is what .NET's implementation currently does.

It can be necessary to suppress other signals as well. Other signals can also generate undesirable loops in a sense. An http client metric incremented on export will be exported on the next cycle - which would occur infinitely. Same kind of thing with logs. Tangentially, it can complicate things for us if/when we want to generate telemetry about the SDK itself, but I think this is a different problem to solve.

@samarth-math
Copy link

samarth-math commented Dec 7, 2023

Is there a clean way to have a feature flag which stops exporting the trace to the backend?

(I'm using gcp and opentelemetry with python, and want to disable tracing while developing locally, but enable in production)

@dyladan dyladan added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed release:after-ga Not required before GA release, and not going to work on before GA labels Mar 29, 2024
@austinlparker austinlparker moved this to Spec Icebox in 🔭 Main Backlog Apr 16, 2024
@austinlparker austinlparker moved this from Spec - Accepted to Spec - Priority Backlog in 🔭 Main Backlog Apr 16, 2024
@mackinra
Copy link

mackinra commented Jul 5, 2024

OpenTelemetry Ruby does this and I really like the idea

@dyladan I have a need to suppress some noisy traces (that effectively poll a message queue) in our app. Can you elaborate on how this can be done in Ruby?

@marcalff
Copy link
Member

An OpenTracing exporter uses libraries that themselves may be instrumented (risking infinite tracing).

This is relevant for opentelemetry-cpp.

The OTLP GRPC exporter uses grpc, and the grpc C++ library can have instrumentation that uses opentelemetry-cpp.

cc @open-telemetry/cpp-maintainers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue maintainer-request Escalated by SIG maintainers spec:context Related to the specification/context directory spec:trace Related to the specification/trace directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor
Projects
Status: Spec - Priority Backlog