Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

Closed
lmolkova opened this issue Oct 31, 2024 · 10 comments · Fixed by #1716

Comments

@lmolkova
Copy link
Contributor

lmolkova commented Oct 31, 2024

The common approach seems to be:

  • do nothing if the exception is handled by the instrumented library (retries, etc)
  • for unhandled (by client lib) exceptions:
    • set span status to error
    • set span status description to exception message
    • set error.type attribute on spans/metrics based on the exception type or more specific low-cardinality error code
    • DO NOT record exception event (by default) unless recording local root span or when the instrumentation knows that user code didn't handle the exception. Reason: exceptions are huge and expensive. Users decide if/how to record them when they catch them. If exception stays unhandled, the server instrumentation will record it.

We should document it and link from DB/messaging/gen-ai and other conventions.

@alanwest
Copy link
Member

Another aspect of this to consider is whether information should be captured from the outermost exception or innermost in the case of nested exceptions.

In capturing error.type and span status description, my intuition is that the innermost exception type/message is the most useful because it most closely describes the root of the problem.

On the other hand, when a user opts in to recording exception events, my intuition is that the outermost exception may be the most useful because the stack trace captured usually contains information from all nested exceptions.

Interestingly, the DB semantic conventions seems to suggest capturing details from the innermost exception:

[9]: The error.type SHOULD match the db.response.status_code returned by the database or the client library, or the canonical name of exception that occurred. When using canonical exception type name, instrumentation SHOULD do the best effort to report the most relevant type. For example, if the original exception is wrapped into a generic one, the original exception SHOULD be preferred. Instrumentations SHOULD document how error.type is populated.

The HTTP semantic conventions seem less opinionated and does not have a similar statement.

@trask
Copy link
Member

trask commented Nov 5, 2024

  • DO NOT record exception event (by default) unless recording SERVER/CONSUMER span when the instrumentation knows that user code didn't handle the exception.

a slight variation on this is to replace SERVER/CONSUMER span above with "local root" span

@trask
Copy link
Member

trask commented Nov 6, 2024

since it appears that github discussions don't get backreferenced, linking here: open-telemetry/opentelemetry-java-instrumentation#12125

@cheempz
Copy link

cheempz commented Nov 19, 2024

Tacking on a related question: in the issue description it seems setting span status to error and setting span error.type attribute should both happen when an operation errors. I'm wondering if this is implicit in the HTTP span semconv (that span status of error must also set error.type attribute, and vice versa), or if it's purposely left open for instrumentation libraries / end users to decide whether one or both of these are set. Thanks!

@trask
Copy link
Member

trask commented Nov 19, 2024

hi @cheempz, check out this part of https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md:

If the request fails with an error before response status code was sent or received, error.type SHOULD be set to exception type (its fully-qualified class name, if applicable) or a component-specific low cardinality error identifier.

If response status code was sent or received and status indicates an error according to HTTP span status definition, error.type SHOULD be set to the status code number (represented as a string), an exception type (if thrown) or a component-specific error identifier.

@cheempz
Copy link

cheempz commented Nov 19, 2024

Thanks @trask! If I'm reading correctly it means span status of error should also set error.type, does it also mean conversely if error.type is set the span status should also be set to error? What we're trying to figure out is when analyzing the trace, can we simply look at span status error as the indicator that an operation had an error, then use span error.type attribute and span exception event as additional information about the error... or is it that any of these three (span status, span attribute, span event) could occur independently and indicate an error. I know it's ultimately in the end user's hands if they want to set/override any of these, just wondering if there is semconv guidelines around the intended use of these elements.

hi @cheempz, check out this part of https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md:

If the request fails with an error before response status code was sent or received, error.type SHOULD be set to exception type (its fully-qualified class name, if applicable) or a component-specific low cardinality error identifier.
If response status code was sent or received and status indicates an error according to HTTP span status definition, error.type SHOULD be set to the status code number (represented as a string), an exception type (if thrown) or a component-specific error identifier.

@trask
Copy link
Member

trask commented Nov 19, 2024

can we simply look at span status error as the indicator that an operation had an error

👍 (I hope this is answered here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status)

@lmolkova
Copy link
Contributor Author

lmolkova commented Nov 19, 2024

Related: #1560 (comment)

Quoting @trask here for the context

Feedback from the spec SIG was that this sounds like a good default behavior, but we should make it configurable.

Some options for making it configurable:

(A) handle this all in the SDK:

but I don't think we can change the default behavior of the SDK.

(B) Another option could be to introduce "trace advice" (similar to metric advice) that would allow a given instrumentation the ability to opt-in to this new behavior.

(C) Another option could be:

  • new configuration added to TracerConfig
  • instead of changing the (existing) behavior of RecordException, ask instrumentations to check this configuration value before calling RecordException

(D) Another option could be:

(E) Or even:

  • Introduce a global instrumentation configuration property that covers all instrumentations
  • ask instrumentations to check this configuration value before calling RecordException
  • we may also want to be able to override this at individual instrumentation level somehow

@lmolkova
Copy link
Contributor Author

lmolkova commented Nov 19, 2024

Ultimately I don't think that we should record exceptions on spans - we should use log-based events for it since
people may want to record errors when spans are sampled out or when tracing is disabled. Plus logs have severity and other future benefits such as body which could accommodate structured stack traces and exception chains.

If we think about logs there are at least two options that make sense:

  • whoever threw the exception, logs that exception - this prevents everyone who rethrows (wraps/aggregates) it from re-logging the stacktrace over and over again. In this case we log every exception once and it has the right context. But we don't yet know if it will be handled and might end up logging a lot of noise.
  • the catch-all handler in the application logs it - this would allow to record only unhandled exceptions, but they would (at best) have root span as the context.

There are options in between (do both, or log every time exception is created or suppressed by another one).

So I think we need some sort of an exception logging strategy enum and it should be on the LoggerConfig.

@lmolkova
Copy link
Contributor Author

lmolkova commented Nov 19, 2024

One more thing to consider: almost every library is instrumented natively with logs and records exceptions already. If we have OTel integration for the underlying logger, there is no benefit in re-recording lib exceptions in OTel instrumentations.

We should still provide guidance for green-field native instrumentations on when to log exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment