Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed updates to #184 #1

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions docs/otel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,16 @@

**Status**: [Experimental][DocumentStatus]

This document defines semantic conventions for OTel components (such as processors, exporters, etc).
This document defines semantic conventions for OpenTelemetry
data-reporting components such as processors, exporters. These
components are generally specified an OpenTelemetry SDK specification,
for example [Span
Processor](https://opentelemetry.io/docs/specs/otel/trace/sdk/#span-processor)
and [Span
Exporter](https://opentelemetry.io/docs/specs/otel/trace/sdk/#span-exporter).

OTel Component semantic conventions are defined for the following metrics:

* [Export](export-metrics.md): For export level metrics.
* [Export](export-metrics.md): For export-level metrics.

[DocumentStatus]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.22.0/specification/document-status.md
151 changes: 127 additions & 24 deletions docs/otel/export-metrics.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
<!--- Hugo front matter used to generate the website version of this page:
linkTitle: OTel Export
linkTitle: OpenTelemetry Export
--->

# Semantic Conventions for OTel Export Metrics
# Semantic Conventions for OpenTelemetry Export Metrics

**Status**: [Experimental][DocumentStatus]

This document describes instruments and attributes for OTel
Export level metrics. Consider the [general metric semantic
This document describes instruments and attributes for OpenTelemetry
export-level metrics. Consider the [general metric semantic
conventions](README.md#general-metric-semantic-conventions) when creating
instruments not explicitly defined in the specification.

Expand All @@ -16,49 +16,152 @@ instruments not explicitly defined in the specification.
<!-- toc -->

- [Metric Instruments](#metric-instruments)
* [Metric: `otel.processor.spans`](#metric-otelprocessorspans)
* [Metric: `otel.exporter.spans`](#metric-otelexporterspans)
* [Metric: `otel.processor.items`](#metric-otelprocessoritems)
* [Metric: `otel.exporter.items`](#metric-otelexporteritems)

<!-- tocstop -->

## Principles used

This specification defines three levels of detail possible, allowing
for components to be used with `basic`, `normal`, and `detailed`
levels that determine which attributes are kept or removed, as by a
[Metric
view](https://opentelemetry.io/docs/specs/otel/metrics/sdk/#view). We
rely on a conservation principle for pipelines, which states generally
that what goes in, comes out.

### Signal-independent metric names

OpenTelemetry currently has 3 signal types, but it may add more.
Instead of using the signal name in the metric names, we opt for a
general-purpose noun that usefully describes any signal. The
signal-agnostic term used here is `items`, referring to spans, log
records, and metric data points. An attribute to distinguish the
`signal` is used, with names designated by the project `traces`,
`logs`, and `metrics`.

Users are expected to understand that the data item for traces is a
span, for logs is a record, and for metrics is a point.

### Distinguishing Collectors from SDKs

SDKs and Collectors process the same data in a pipeline, and both
OpenTelemetry Collector and SDKs are recommended to use the metric
names specified here. An attribute to distinguish the `domain` is
used, with values like `sdk`, `collector`.

In a multi-level collection pipeline, each layer is expected to use a
unique domain. This enables calculating aggregates at each level in
the collection pipeline and comparing them, as a measure of aggregate
leakage. Multi-level colletor topologies should allow configuration
of distinct domains (e.g., `agent` and `gateway`).

### Basic level of detail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.

When you have metrics on a pipeline, the information available by having a success attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. If Total(x) is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is:

Dropped(this) = Total(next) - Total(this)
Failed(this) = Dropped(this) + Failed(next)
Success(this) = Total(this) - Failed(this)


At the basic level of detail, we only need to know what goes in to a
component, because we are able to infer a lot about the component by
comparing its metrics with the next component in the pipeline. By
conservation, for example, any items that are received by a SDK
processor component and are not received by the SDK exporter component
must have been dropped.

Therefore at the basic level of detail, all items of data are counted
when they are done, regardless of the outcome. No additional
attributes are used at this level of detail.

### Normal level of detail

At the normal level of detail, an attribute is introduced that
distinguishes whether the item was or was not successful. A boolean
attribute `success` is introduced at this level.

It is understood that components have limited information about the
success or failure of subsequent pipeline components. While the
veracity of `success=true` may be subject to reasonable doubt, the
`success=false` attribute should be accepted as fact. In a SDK
configuration, the processor's `success=false` may be compared with
the exporter's `success=false` to determine the number of items
dropped by the processor, for example.

### Detailed metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way?

Copy link
Author

@jmacd jmacd Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used.

I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items .

This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK.

I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is resource_exhausted I should turn up sampling, if it's timeout I should complain to my backend team about an SLO violation, and if it's queue_full it means I should reconfigure the SDK.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This content will appear in a new location, I'm writing an OTEP.)
My assumption is that this would be implemented using views, and the text of a semantic convention would be explaining which views to configure at which level of detail.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


At the detailed level of metrics, the component includes an additional
`status` to explain its outcomes. These should be interpreted
relative to the value of the `success` attribute, which is always
present when detailed metrics are in use. For the `success=true`
case, components may are recommended to use `reason=ok`.

Components should use short, descriptive names to explain failure
outcomes. For example, a SDK span processor could use
`reason=queue_full` to annotate dropped spans and
`reason=export_failed` to indicate when the exporter failed.

Exporter components are encouraged to use system specific details as
the reason. For example, gRPC-based exporter would naturally use the
string form of the gRPC status code as the reason (e.g.,
`deadline_exceeded`, `resource_exhausted`, `unimplemented`).

### Component types and optional names

Components are uniquely identified using a descriptive `name`
attribute which encompasses at least a short name describing the type
of component being used (e.g., `batch` for the SDK BatchSpanProcessor
or the Collector batch proessor).

When there is more than one component of a given type active in a
pipeline having the same `domain` and `signal` attributes, the `name`
should include additional information to disambiguate the multiple
instances using the syntax `<type>/<instance>`. For example, if there
were two `batch` processors in a collection pipeline (e.g., one for
error spans and one for non-error spans) they might use the names
`batch/error` and `batch/noerror`.

## Metric Instruments

### Metric: `otel.processor.spans`
### Metric: `otel.processor.items`

This metric is [required][MetricRequired].

<!-- semconv metric.otel.processor.spans(metric_table) -->
<!-- semconv metric.otel.processor.items(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description |
| -------- | --------------- | ----------- | -------------- |
| `otel.processor.spans` | Counter | `{span}` | Measures the number of processed Spans. |
| `otel.processor.items` | Counter | `{items}` | Measures the number of processed items (signal specific). |
<!-- endsemconv -->

<!-- semconv metric.otel.processor.spans(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `processor.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required |
| `processor.type` | string | Type of processor being used. | `BatchSpanProcessor` | Recommended |

**[1]:** Spans may be dropped if the internal buffer is full.
<!-- semconv metric.otel.processor.items(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---------------------|---------|----------------------------------------------------------|----------------------------------------------------|-------------------|
| `processor.domain` | string | Domain of the pipeline with this exporter | `sdk`, `collector` | Required |
| `processor.name` | string | Type and optional name of this exporter. | `batch`, `batch/errors` | Required |
| `processor.signal` | string | Type of signal being described. | `trace`, `logs`, `metrics` | Required |
| `processor.success` | boolean | Whether the item was successful or not. [1] | true, false | Recommended |
| `processor.reason` | string | Short string explaining category of success and failure. | `ok`, `queue_full`, `timeout`, `permission_denied` | Detailed |

**[1]:** Consider `success=false` a stronger signal than `success=true`
<!-- endsemconv -->

### Metric: `otel.exporter.spans`
### Metric: `otel.exporter.items`

This metric is [required][MetricRequired].

<!-- semconv metric.otel.exporter.spans(metric_table) -->
<!-- semconv metric.otel.exporter.items(metric_table) -->
| Name | Instrument Type | Unit (UCUM) | Description |
| -------- | --------------- | ----------- | -------------- |
| `otel.exporter.spans` | Counter | `{span}` | Measures the number of exported Spans. |
| `otel.exporter.items` | Counter | `{items}` | Measures the number of exported items (signal specific). |
<!-- endsemconv -->

<!-- semconv metric.otel.exporter.spans(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|---|---|---|---|---|
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required |
| `exporter.type` | string | Type of exporter being used. | `OtlpGrpcSpanExporter` | Recommended |
<!-- semconv metric.otel.exporter.items(full) -->
| Attribute | Type | Description | Examples | Requirement Level |
|--------------------|---------|----------------------------------------------------------|----------------------------------------------------|-------------------|
| `exporter.domain` | string | Domain of the pipeline with this exporter | `sdk`, `collector` | Required |
| `exporter.name` | string | Type and optional name of this exporter. | `otlp/http`, `otlp/grpc` | Required |
| `exporter.signal` | string | Type of signal being described. | `trace`, `logs`, `metrics` | Required |
| `exporter.success` | boolean | Whether the item was successful or not. [1] | true, false | Recommended |
| `exporter.reason` | string | Short string explaining category of success and failure. | `ok`, `queue_full`, `timeout`, `permission_denied` | Detailed |

**[1]:** Spans may be dropped in case of failed ingestion, e.g. network problem or the exported endpoint being down.
**[1]:** Items may be dropped in case of failed ingestion, e.g. network problem or the exported endpoint being down. Consult transport-specific instrumentation for more information about the export requests themselves, including retry attempts.
<!-- endsemconv -->

[MetricRequired]: https://github.com/open-telemetry/opentelemetry-specification/blob/v1.22.0/specification/metrics/metric-requirement-level.md#required
39 changes: 1 addition & 38 deletions model/metrics/otel.yaml
Original file line number Diff line number Diff line change
@@ -1,38 +1 @@
groups:
- id: metric.otel.exporter.spans
type: metric
metric_name: otel.exporter.spans
brief: "Measures the number of exported Spans."
instrument: counter
unit: "{span}"
attributes:
- id: exporter.dropped
type: boolean
requirement_level: required
brief: "Whether the Span was dropped or not."
note: >
Spans may be dropped in case of failed ingestion, e.g. network problem
or the exported endpoint being down.
- id: exporter.type
type: string
requirement_level: recommended
brief: "Type of exporter being used."
examples: ["OtlpGrpcSpanExporter"]
- id: metric.otel.processor.spans
type: metric
metric_name: otel.processor.spans
brief: "Measures the number of processed Spans."
instrument: counter
unit: "{span}"
attributes:
- id: processor.dropped
type: boolean
requirement_level: required
brief: "Whether the Span was dropped or not."
note: >
Spans may be dropped if the internal buffer is full.
- id: processor.type
type: string
requirement_level: recommended
brief: "Type of processor being used."
examples: ["BatchSpanProcessor"]
DO NOT REVIEW