Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Span Status Improvements (without errors) #133

Closed
wants to merge 6 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 83 additions & 148 deletions text/0123-improve-span-status-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Allow the Span Status API to represent more kinds of status

## Motivation

Right now OpenTelemetry Status is defined as an enumeration of gRPC status
Right now OpenTelemetry Status is defined as an enumeration of gRPC status
codes. Although I couldn't find design criteria written down for this API
I fear it is too narrowly defined to be useful across the full breadth of
scenarios OpenTelemetry targets.
Expand All @@ -13,26 +13,26 @@ OpenTelemetry allows Spans to be created to represent any operation, including
those which don't involve communication with another component (Kind =
Internal). These underlying operations can have native status representations
from a particular domain or a language such as POSIX status codes, HRESULTs,
many variations of exceptions, error messages, HTTP status, or gRPC status.
HTTP status, or gRPC status.
However to capture the status as part of an OpenTelemetry span it must first be
mapped to something in OpenTelemetry's object model and this mapping has the
potential to create a few problems:

- **Inconsistency** - If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK
implementations are unlikely to choose the same mapping. This makes collected
data hard to work with because it can no longer be treated uniformly.
OpenTelemetry representation isn't well-defined then API users or SDK
implementations are unlikely to choose the same mapping. This makes collected
data hard to work with because it can no longer be treated uniformly.
- **Loss of fidelity** - Mapping from a status representation that
distinguishes a large number of different results to one that only
distinguishes a few is inherently lossy. This prevents users from isolating
different status results they care about. It can also prevent UI from showing
useful status information because users don't relate to the reduced
representation and the transformation isn't reversible.
distinguishes a large number of different results to one that only
distinguishes a few is inherently lossy. This prevents users from isolating
different status results they care about. It can also prevent UI from showing
useful status information because users don't relate to the reduced
representation and the transformation isn't reversible.
- **Conversion difficulty** - If the conversions are non-trivial then they
are unlikely to be implemented correctly or perhaps at all. Past feedback
suggests end users want to spend little to no effort on this task. SDK
implementers may be more diligent but are constrained to native status
representations that are known a-priori.
are unlikely to be implemented correctly or perhaps at all. Past feedback
suggests end users want to spend little to no effort on this task. SDK
implementers may be more diligent but are constrained to native status
representations that are known a-priori.

These are challenges for any design of the status API, not solely the current
one. We will need to evaluate these issues as a matter of degree and make a
Expand All @@ -46,32 +46,32 @@ tasks I anticipate tool vendors and end-users would like to be possible with
OpenTelemetry status information:

1. **Viewing** - Developers diagnosing a specific distributed trace want to
understand the status of spans that occurred while it ran. To do this they want
to see status information annotated in the trace, ideally with progressive
levels of detail as the focus of investigation narrows. The viewed status
information should be easy to correlate back to the domain that generated it
and diagnostically useful.
understand the status of spans that occurred while it ran. To do this they want
to see status information annotated in the trace, ideally with progressive
levels of detail as the focus of investigation narrows. The viewed status
information should be easy to correlate back to the domain that generated it
and diagnostically useful.

2. **Searching/Filtering** - Developers suspect a particular status condition
might be occurring due to customer feedback, some behavior they observed
locally, in another trace, or perhaps code review. They want to search
collected telemetry to determine if and how often that status condition
occurs. If it occurs they want to explore example traces producing it to
better understand when it manifests and how it impacts their system. The
search terms should be intuitive given the developers initial knowledge
about the status condition they are searching for and the domain it arose from.
might be occurring due to customer feedback, some behavior they observed
locally, in another trace, or perhaps code review. They want to search
collected telemetry to determine if and how often that status condition
occurs. If it occurs they want to explore example traces producing it to
better understand when it manifests and how it impacts their system. The
search terms should be intuitive given the developers initial knowledge
about the status condition they are searching for and the domain it arose from.

3. **Grouping and metrics** - Developers monitoring a service want to
understand what kinds of status conditions are occurring and how frequently.
To do this they want the monitoring tool to provide UI that buckets results
into categories and show the top categories ranked by count of occurrence.
They may also want metrics that track category counts over time to identify
trends and deviations from the trend. Many useful groupings are based on
sharing a common problem symptom or common root cause, but other more coarse
groupings may be useful in trend analysis. At the most coarse spans can be
divided into some definition of "successful" and "failed" but there is no
consensus on how some status results should be bucketed (for example http
4xx results).
understand what kinds of status conditions are occurring and how frequently.
To do this they want the monitoring tool to provide UI that buckets results
into categories and show the top categories ranked by count of occurrence.
They may also want metrics that track category counts over time to identify
trends and deviations from the trend. Many useful groupings are based on
sharing a common problem symptom or common root cause, but other more coarse
groupings may be useful in trend analysis. At the most coarse spans can be
divided into some definition of "successful" and "failed" but there is no
consensus on how some status results should be bucketed (for example http
4xx results).

### Scope

Expand All @@ -82,9 +82,9 @@ successful in an initial release without having standardized a wire protocol.

The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status or error. If those
sub-operations aren't represented with a child span then it is out of scope
how their status is captured.
those sub-operations might have a notion of status. If those sub-operations
aren't represented with a child span then it is out of scope how their status is
captured.

## Explanation

Expand All @@ -95,55 +95,32 @@ translation effort from the native representation and capture at least the key
numeric/string data developers are likely to search for or relate to. I suggest
status is this set of information:

1. StatusType - The name of the type of error such as "HTTP", "gRPC",
"LanguageException", "HRESULT", "POSIX", or "ErrorMessage". The list is
end-user-extensible but common status type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
2. StatusData - A discriminated union of:
- An integer, a string, or a tuple of integer and string. These options can
be used for:
- Enumerated status codes: For example an http status code could be
represented as 404, "Not Found", or both. In the case of common status codes
OpenTelemetry SDK or backend could optionally assist in filling out the
remainder of a partially specified enumeration value. For enumerations that
aren't well-known the community of users is responsible for determining any
conventions.
- Free-form error messages
- Exception object - whatever the SDK language's default exception datatype
is, if it has one.
- void
3. SuccessHint - An optional boolean that represents the span author's best
guess whether this status represents a successful or failed operation, however
they choose to define those terms. For well-known status types I'd suggest the
hint be ignored but for user-defined status types this is likely the only clue
whether the span should be surfaced in a UI as being abnormal or failed.
1. Domain - The name of the domain the status data applies to such as "HTTP",
"gRPC", "HRESULT", "POSIX". The list is end-user-extensible but common status
type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
2. Code - An integer status code. Can be combined with an status message. Either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer is insufficient. E.g. POSIX codes are symbolic names (like ENOENT) which do not have standardized integer values (e.g. ENOENT might have a different value on Linux than on BSD)

a code or message are required.
3. Message - A string status message. Can be combined with a status code. Either
a code or message are required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a code or message are required.
a code or message is required.


Although UI creators are free experiment with how the data is presented I
expect most presentations would either be the StatusData alone, or the
StatusData qualified with the StatusType and some separator character. For
example StatusData alone might create names like "FileNotFoundException",
"503", "E_FAIL (0x80004005)", "SyntaxError on line 405: Did you forget a
semicolon?", and "BadQuery".

Exceptions could have progressive level of detail drilling into messages,
stack traces, inner exceptions, links to source, etc if the exporter serialized
sufficient data but how and whether that occurs is out-of-scope in this design.
example StatusData alone might create names like "503", "E_FAIL (0x80004005)",
"Status Code 12: Unimplemented".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated since StatusData was renamed to Code and StatusType to Domain above and Message was introduced as a separate parameter.


#### API to capture this data

I would suggest an API called Span.SetStatus(...) that takes all the arguments
above, and optionally overloads or default parameters that make common calls
easier. For example in C#:

````C#
void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null);
void SetStatus(string enumNameOrMessage, string statusType = "ErrorMessage", bool? successHint = null);
void SetStatus(int enumValue, string statusType, bool? successHint = null );
void SetStatus(int enumValue, string enumName, string statusType, bool? successHint = null);
void SetStatus(string statusType, bool? successHint);
````


```C#
void SetStatus(string domain, int code, string message);
void SetStatus(string domain, string message);
void SetStatus(string domain, int code);
```

## Internal details

Expand All @@ -154,46 +131,20 @@ events. The choice of storage may have some modest effects on memory usage
but primarily I expect the choice would be driven by the SDK API we want to
read stored data back.

It is also possible for the SDK to destructure the Exception data into simpler
serializable types though I'd expect serialization is typically the domain of
the exporter and there is a fair amount of policy involved in terms of what
data is captured and how it is formatted for transport. There are definite
risks that the end-to-end scenario will be less functional or less performant
if SDKs intercede here.

In some languages storing an exception with traceback could be very
memory intensive. Python prevents locals in the callstack from being
GC'ed. One suggestion is to use a synchronous callback to the exporter
allowing it process the data immediately in some way that would lower
the memory usage. For example the exporter could indicate the
exception could be disposed after optionally extracting a fraction of the
information to serialize.


## Trade-offs and mitigations

As mentioned in the motivation section, the issues of inconsistency, loss of
fidelity and conversion difficulty are all on a sliding scale. I expect this
design improves each of these issues at the expensive of some increased
object-model/API complexity. Making the OpenTelemetry status representation
more expressive also could cause inconsistency problems for the opposite reason

- now there might be multiple reasonable representations for a status and the
user becomes unclear which one to pick. In general I expect this to be mitigated
with documented conventions, API defaults, and the potentially canonicalizing
data anywhere in the processing pipeline.

One place I neglected to go further was defining additional types of error data
format beyond string/int/Exception. This might encapsulate things such as
key/value pairs (ala structured logging) or more complex or niche status types
(COM IErrorInfo). There is nothing inherently problematic with them but I felt
the increased expressiveness was getting diminishing returns. One mitigation to
key/value pair data in particular would be to put auxilliary data in the span
attributes using some convention, for example "Error.UserName"="Bob" might be
used together with an error message string "Failed to find user {UserName}".
Another mitigation might be adding language specific overloads that handle
additional error types.

The current proposal only gives one string which can be used either for a
user becomes unclear which one to pick. In general I expect this to be mitigated
with documented conventions, API defaults, and the potentially canonicalizing
data anywhere in the processing pipeline.

The current proposal only gives one string which can be used either for a
freeform message or a textual status code. Adding a 2nd string to the
StatusData would allow both to be collected side by side. This is another example of
Copy link
Member

@arminru arminru Aug 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above (#133 (comment)).

increasing expressivity at the cost of some complexity. I'd be happy to see
Expand All @@ -210,16 +161,9 @@ valuable with potentially a slight

There have been a few past attempts to make improvements here:

- [open-telemetry/oteps#69](https://github.com/open-telemetry/oteps/pull/69)
- [#427](https://github.com/open-telemetry/opentelemetry-specification/pull/427)
- [#432](https://github.com/open-telemetry/opentelemetry-specification/pull/432)
- [#521](https://github.com/open-telemetry/opentelemetry-specification/pull/521)
- [#599](https://github.com/open-telemetry/opentelemetry-specification/issues/599)
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123)
- [open-telemetry/oteps#123](https://github.com/open-telemetry/oteps/pull/123) on which this OTEP is based
- [open-telemetry/opentelemetry-specification#697](https://github.com/open-telemetry/opentelemetry-specification/pull/697) for reporting exceptions

- https://gitter.im/open-telemetry/error-events-wg

There are also some links to further prior art within those links, sorry
I didn't organize it all nicely here : )

#### Alternatives

**Status information using logging** - It is possible to collect status and
Expand All @@ -229,20 +173,20 @@ integration between distributed tracing and logging I don't believe this is
a problem that distributed tracing should abdicate for various reasons:

- Logging isn't a simple or cheap dependency to take in the implementation.
While OpenTelemetry may solidify a logging offering for API and SDK components,
an end-to-end scenario still requires snooping the log stream to determine
relevant messages for a given span. Regardless whether this is done
client-side, at the database or in the UI layer it potentially involves
performance overhead of handling an order of magnitude more data.
While OpenTelemetry may solidify a logging offering for API and SDK components,
an end-to-end scenario still requires snooping the log stream to determine
relevant messages for a given span. Regardless whether this is done
client-side, at the database or in the UI layer it potentially involves
performance overhead of handling an order of magnitude more data.
- It requires establishing conventions that designate which log message
represents 'the status' for a Span rather than one of potentially many results
or errors that were recorded during the Span's duration. This likely means this
status is a special case on the logging API if it isn't a special case on the
distributed trace API.
represents 'the status' for a Span rather than one of potentially many results
or errors that were recorded during the Span's duration. This likely means this
status is a special case on the logging API if it isn't a special case on the
distributed trace API.
- I expect developers both at the time they are emitting trace data and when
viewing that trace data would find it is idiomatic to include the status of
the Span's workload together with the description of the Span. Identifying
failed or abnormal spans is a typical APM operation.
viewing that trace data would find it is idiomatic to include the status of
the Span's workload together with the description of the Span. Identifying
failed or abnormal spans is a typical APM operation.

**Closed vs. open-ended status descriptions** - We could make status
represented by a fixed set of options such as the current gRPC codes
Expand All @@ -263,19 +207,13 @@ API is preferable I don't have that strong of an opinion. Semantic conventions
are likely to have higher performance overhead, higher risk of error in key
names and are less discoverable/refactorable in IDEs. The advantages are that
there is some past precedent for doing this specific to http codes and new
conventions can be added easily. If we go semantic conventions it does imply
that Exception becomes a type that can be directly passed as an argument to
SetAttribute(). Requiring the user to destructure the exception into a list of
key value pairs would be overly onerous and error-prone for a common usage
scenario. If desired the SDK or exporter could destructure it, but that can be
determined independently from API design and I'd like to keep it out of scope.
conventions can be added easily.

**API using Span event semantic conventions** - Most of the rationale for attribute semantic
conventions also applies here, events are effectively another key-value store
mechanism. The timestamp that is attached to an event appears to hold little
value as status is probably produced at the same approximate time the span end
timestamp is recorded. Similar to attribute conventions it sounds like there is
precedent for storing some errors as events.
timestamp is recorded.

**Move the API to a non-core package** - It is possible to have the Tracing
API expose status using Attribute or Event APIs, and then have a 2nd library
Expand All @@ -285,19 +223,16 @@ declared directly on Span but if we identify this as an area that needs to be
more decoupled/versionable than other Span tracing APIs perhaps it would be
valuable.

**Represent an error message in addition to a string error name**

## Open questions

1. Although I specified that common status types could be given standardized
names, I didn't define what that list was. We would need to define what
criteria makes a status type common enough to be on the list and maintain it
over time.

names, I didn't define what that list was. We would need to define what
criteria makes a status type common enough to be on the list and maintain it
over time.
2. Above the design mentioned that the SDK might fill in the name or integer
value of a well-known status code when only one of the two was specified by a
user. We'd have to decide if that is functionality we want, and which values
are included in mapping tables.
value of a well-known status code when only one of the two was specified by a
user. We'd have to decide if that is functionality we want, and which values
are included in mapping tables.
3. I left the SDK API out-of-scope, but we will need a way to retrieve the
stored data via the SDK before adding data to span has any value in an
end-to-end scenario.
stored data via the SDK before adding data to span has any value in an
end-to-end scenario.