-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed updates to #184 #1
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,13 @@ | ||
<!--- Hugo front matter used to generate the website version of this page: | ||
linkTitle: OTel Export | ||
linkTitle: OpenTelemetry Export | ||
---> | ||
|
||
# Semantic Conventions for OTel Export Metrics | ||
# Semantic Conventions for OpenTelemetry Export Metrics | ||
|
||
**Status**: [Experimental][DocumentStatus] | ||
|
||
This document describes instruments and attributes for OTel | ||
Export level metrics. Consider the [general metric semantic | ||
This document describes instruments and attributes for OpenTelemetry | ||
export-level metrics. Consider the [general metric semantic | ||
conventions](README.md#general-metric-semantic-conventions) when creating | ||
instruments not explicitly defined in the specification. | ||
|
||
|
@@ -16,49 +16,152 @@ instruments not explicitly defined in the specification. | |
<!-- toc --> | ||
|
||
- [Metric Instruments](#metric-instruments) | ||
* [Metric: `otel.processor.spans`](#metric-otelprocessorspans) | ||
* [Metric: `otel.exporter.spans`](#metric-otelexporterspans) | ||
* [Metric: `otel.processor.items`](#metric-otelprocessoritems) | ||
* [Metric: `otel.exporter.items`](#metric-otelexporteritems) | ||
|
||
<!-- tocstop --> | ||
|
||
## Principles used | ||
|
||
This specification defines three levels of detail possible, allowing | ||
for components to be used with `basic`, `normal`, and `detailed` | ||
levels that determine which attributes are kept or removed, as by a | ||
[Metric | ||
view](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). We | ||
rely on a conservation principle for pipelines, which states generally | ||
that what goes in, comes out. | ||
|
||
### Signal-independent metric names | ||
|
||
OpenTelemetry currently has 3 signal types, but it may add more. | ||
Instead of using the signal name in the metric names, we opt for a | ||
general-purpose noun that usefully describes any signal. The | ||
signal-agnostic term used here is `items`, referring to spans, log | ||
records, and metric data points. An attribute to distinguish the | ||
`signal` is used, with names designated by the project `traces`, | ||
`logs`, and `metrics`. | ||
|
||
Users are expected to understand that the data item for traces is a | ||
span, for logs is a record, and for metrics is a point. | ||
|
||
### Distinguishing Collectors from SDKs | ||
|
||
SDKs and Collectors process the same data in a pipeline, and both | ||
OpenTelemetry Collector and SDKs are recommended to use the metric | ||
names specified here. An attribute to distinguish the `domain` is | ||
used, with values like `sdk`, `collector`. | ||
|
||
In a multi-level collection pipeline, each layer is expected to use a | ||
unique domain. This enables calculating aggregates at each level in | ||
the collection pipeline and comparing them, as a measure of aggregate | ||
leakage. Multi-level colletor topologies should allow configuration | ||
of distinct domains (e.g., `agent` and `gateway`). | ||
|
||
### Basic level of detail | ||
|
||
At the basic level of detail, we only need to know what goes in to a | ||
component, because we are able to infer a lot about the component by | ||
comparing its metrics with the next component in the pipeline. By | ||
conservation, for example, any items that are received by a SDK | ||
processor component and are not received by the SDK exporter component | ||
must have been dropped. | ||
|
||
Therefore at the basic level of detail, all items of data are counted | ||
when they are done, regardless of the outcome. No additional | ||
attributes are used at this level of detail. | ||
|
||
### Normal level of detail | ||
|
||
At the normal level of detail, an attribute is introduced that | ||
distinguishes whether the item was or was not successful. A boolean | ||
attribute `success` is introduced at this level. | ||
|
||
It is understood that components have limited information about the | ||
success or failure of subsequent pipeline components. While the | ||
veracity of `success=true` may be subject to reasonable doubt, the | ||
`success=false` attribute should be accepted as fact. In a SDK | ||
configuration, the processor's `success=false` may be compared with | ||
the exporter's `success=false` to determine the number of items | ||
dropped by the processor, for example. | ||
|
||
### Detailed metrics | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used. I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items . This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK. I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (This content will appear in a new location, I'm writing an OTEP.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still rough: open-telemetry/oteps#238 |
||
|
||
At the detailed level of metrics, the component includes an additional | ||
`status` to explain its outcomes. These should be interpreted | ||
relative to the value of the `success` attribute, which is always | ||
present when detailed metrics are in use. For the `success=true` | ||
case, components may are recommended to use `reason=ok`. | ||
|
||
Components should use short, descriptive names to explain failure | ||
outcomes. For example, a SDK span processor could use | ||
`reason=queue_full` to annotate dropped spans and | ||
`reason=export_failed` to indicate when the exporter failed. | ||
|
||
Exporter components are encouraged to use system specific details as | ||
the reason. For example, gRPC-based exporter would naturally use the | ||
string form of the gRPC status code as the reason (e.g., | ||
`deadline_exceeded`, `resource_exhausted`, `unimplemented`). | ||
|
||
### Component types and optional names | ||
|
||
Components are uniquely identified using a descriptive `name` | ||
attribute which encompasses at least a short name describing the type | ||
of component being used (e.g., `batch` for the SDK BatchSpanProcessor | ||
or the Collector batch proessor). | ||
|
||
When there is more than one component of a given type active in a | ||
pipeline having the same `domain` and `signal` attributes, the `name` | ||
should include additional information to disambiguate the multiple | ||
instances using the syntax `<type>/<instance>`. For example, if there | ||
were two `batch` processors in a collection pipeline (e.g., one for | ||
error spans and one for non-error spans) they might use the names | ||
`batch/error` and `batch/noerror`. | ||
|
||
## Metric Instruments | ||
|
||
### Metric: `otel.processor.spans` | ||
### Metric: `otel.processor.items` | ||
|
||
This metric is [required][MetricRequired]. | ||
|
||
<!-- semconv metric.otel.processor.spans(metric_table) --> | ||
<!-- semconv metric.otel.processor.items(metric_table) --> | ||
| Name | Instrument Type | Unit (UCUM) | Description | | ||
| -------- | --------------- | ----------- | -------------- | | ||
| `otel.processor.spans` | Counter | `{span}` | Measures the number of processed Spans. | | ||
| `otel.processor.items` | Counter | `{items}` | Measures the number of processed items (signal specific). | | ||
<!-- endsemconv --> | ||
|
||
<!-- semconv metric.otel.processor.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `processor.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | | ||
| `processor.type` | string | Type of processor being used. | `BatchSpanProcessor` | Recommended | | ||
|
||
**[1]:** Spans may be dropped if the internal buffer is full. | ||
<!-- semconv metric.otel.processor.items(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---------------------|---------|----------------------------------------------------------|----------------------------------------------------|-------------------| | ||
| `processor.domain` | string | Domain of the pipeline with this exporter | `sdk`, `collector` | Required | | ||
| `processor.name` | string | Type and optional name of this exporter. | `batch`, `batch/errors` | Required | | ||
| `processor.signal` | string | Type of signal being described. | `trace`, `logs`, `metrics` | Required | | ||
| `processor.success` | boolean | Whether the item was successful or not. [1] | true, false | Recommended | | ||
| `processor.reason` | string | Short string explaining category of success and failure. | `ok`, `queue_full`, `timeout`, `permission_denied` | Detailed | | ||
|
||
**[1]:** Consider `success=false` a stronger signal than `success=true` | ||
<!-- endsemconv --> | ||
|
||
### Metric: `otel.exporter.spans` | ||
### Metric: `otel.exporter.items` | ||
|
||
This metric is [required][MetricRequired]. | ||
|
||
<!-- semconv metric.otel.exporter.spans(metric_table) --> | ||
<!-- semconv metric.otel.exporter.items(metric_table) --> | ||
| Name | Instrument Type | Unit (UCUM) | Description | | ||
| -------- | --------------- | ----------- | -------------- | | ||
| `otel.exporter.spans` | Counter | `{span}` | Measures the number of exported Spans. | | ||
| `otel.exporter.items` | Counter | `{items}` | Measures the number of exported items (signal specific). | | ||
<!-- endsemconv --> | ||
|
||
<!-- semconv metric.otel.exporter.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | | ||
| `exporter.type` | string | Type of exporter being used. | `OtlpGrpcSpanExporter` | Recommended | | ||
<!-- semconv metric.otel.exporter.items(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|--------------------|---------|----------------------------------------------------------|----------------------------------------------------|-------------------| | ||
| `exporter.domain` | string | Domain of the pipeline with this exporter | `sdk`, `collector` | Required | | ||
| `exporter.name` | string | Type and optional name of this exporter. | `otlp/http`, `otlp/grpc` | Required | | ||
| `exporter.signal` | string | Type of signal being described. | `trace`, `logs`, `metrics` | Required | | ||
| `exporter.success` | boolean | Whether the item was successful or not. [1] | true, false | Recommended | | ||
| `exporter.reason` | string | Short string explaining category of success and failure. | `ok`, `queue_full`, `timeout`, `permission_denied` | Detailed | | ||
|
||
**[1]:** Spans may be dropped in case of failed ingestion, e.g. network problem or the exported endpoint being down. | ||
**[1]:** Items may be dropped in case of failed ingestion, e.g. network problem or the exported endpoint being down. Consult transport-specific instrumentation for more information about the export requests themselves, including retry attempts. | ||
<!-- endsemconv --> | ||
|
||
[MetricRequired]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.22.0/specification/metrics/metric-requirement-level.md#required |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1 @@ | ||
groups: | ||
- id: metric.otel.exporter.spans | ||
type: metric | ||
metric_name: otel.exporter.spans | ||
brief: "Measures the number of exported Spans." | ||
instrument: counter | ||
unit: "{span}" | ||
attributes: | ||
- id: exporter.dropped | ||
type: boolean | ||
requirement_level: required | ||
brief: "Whether the Span was dropped or not." | ||
note: > | ||
Spans may be dropped in case of failed ingestion, e.g. network problem | ||
or the exported endpoint being down. | ||
- id: exporter.type | ||
type: string | ||
requirement_level: recommended | ||
brief: "Type of exporter being used." | ||
examples: ["OtlpGrpcSpanExporter"] | ||
- id: metric.otel.processor.spans | ||
type: metric | ||
metric_name: otel.processor.spans | ||
brief: "Measures the number of processed Spans." | ||
instrument: counter | ||
unit: "{span}" | ||
attributes: | ||
- id: processor.dropped | ||
type: boolean | ||
requirement_level: required | ||
brief: "Whether the Span was dropped or not." | ||
note: > | ||
Spans may be dropped if the internal buffer is full. | ||
- id: processor.type | ||
type: string | ||
requirement_level: recommended | ||
brief: "Type of processor being used." | ||
examples: ["BatchSpanProcessor"] | ||
DO NOT REVIEW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.
When you have metrics on a pipeline, the information available by having a
success
attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. IfTotal(x)
is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is: