-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add processed/exported Span metrics. #184
Add processed/exported Span metrics. #184
Conversation
<!-- semconv metric.otel.exporter.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 states from the exporter's view:
- delivered (the exporter has received the confirmation from the ingestion)
- dropped (the exporter decided to drop the data knowing that the ingestion didn't accept it)
- unknown (the exporter has sent the data, but it has no idea whether the data is accepted or not by the ingestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good one, yes. So I guess my initial approached took for granted this:
dropped=false
: Delivered, with confirmation from the ingestion step.dropped=true
: Either consecutive attempts were exhausted OR it's not confirmed ingestion was successful (including error on the server side).
Think it would make sense to have 3 values instead of two here (as suggested by your comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(On a related note: I will follow up this PR, once it's merged, but some additional metrics, such as retries count at the export level, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think processor.dropped
and exporter.dropped
are quite generic in name.
- Are we trying to generalize a pipeline semantic where we can generically track items through a pipleine (like an ESB, something like Apache Camel, etc). or are these meant to be OTEL specific?
- Will these be the same between the OTEL Collector and the SDK? IF not, I think you may want a prefix denoting that.
<!-- semconv metric.otel.processor.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `processor.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not included un-sampled spans, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this doesn't include un-sampled Spans. Will clarify that in a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<!-- semconv metric.otel.exporter.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand how to use this counter and what "dropped' actually means.
Does dropped mean you never tried to export, or that you did try and gave up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. It means that either the spans were dropped for an irrecoverable error (e.g. 4xx), or the max retries count has been reached. You are looking into this clarification or are thinking about some other scenarios as well? (Clarify this as well in this PR in a bit)
Hey @jsuereth thanks for the questions.
I'd personally go with these for now. It feels to me that defining semantic conventions that cover general values outside of OTel would require a lot of conversations previously.
So I was thinking they may be shared but after briefly checking, I'm a little more oriented towards making this SDK specific - this is because the Collector may have slightly different needs (labels/dimensions). |
|
||
**Status**: [Experimental][DocumentStatus] | ||
|
||
This document defines semantic conventions for OTel components (such as processors, exporters, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTel SDK components?
+1 for this work. I am a bit troubled by this being defined as a semantic convention. There's certainly an aspect of that in terms of how the metrics are named, but I also think this should be part of the SDK spec, as in all SDKs should be required to provide these metrics about themselves. Also, there are more to the SDK observability than just the exporter metrics. Here, for example, are the metrics that Jaeger SDKs used to expose: https://github.com/jaegertracing/jaeger-client-go/blob/master/metrics.go#L22-L106 |
It would be nice to have these metrics shared between SDKs and Collector. |
There needs to be a clear distinction between SDK-produced metrics and Collector-produced, because they can co-exist in the same process. I am not fond of the naming scheme being proposed here. I would rather see a hierarchy:
|
Are you envisioning Collector using Otel SDK and thus Collector processor will need to emit both sets of metrics? That makes sense, however do we need the metric names to be different? They can differ by an attribute. |
The use case we had in the Jaeger land where Jaeger backend components themselves were instrumented for tracing and would emit SDK-scoped metrics, which would then conflict with collector-specific metrics.
I don't know if we have this in the guidelines somewhere, but I consider attributes to be "optional" dimensions of the metrics that can be dropped (aggregated over) without destroying semantic meaning of the metric. E.g. tagging a metric with "region" is good, one can aggregate over it if by-region breakdown is not needed. But aggregating metrics from SDK and Collector into one time series does not make sense semantically (e.g. you would double-count spans if you did), so I would expect them to have different names, not just different attributes. |
I don't think we have a guideline like that. I am not sure such a guideline can be applicable to all and every use case. Here is an existing use case that would break if we were to adopt such a guideline. Collector processors currently emit metrics about the number of data points processed. There can be multiple processors chained sequentially. Each processor emits a measurement, with the value of the "processor" dimension set equal to the processor name. Aggregating over the "processor" dimension will indeed result in double-counting. It seems you are suggesting that a practice like that is invalid and we shouldn't be doing this in the Collector, but I am not sure what would be a reasonable alternate. Collector also has similar metrics for receiver and exporter components, where component name is a dimension. As a comparison Prometheus has the following guideline:
Of course the interpretation of "should be meaningful" is up for debate. I could argue that sum() is meaningful, it tells how many times datapoints were processed overall by different processors (including SDK and Collector). Not a very useful aggregation, but Prometheus doesn't require it to be useful, only meaningful. :-) If we change this to "drop" metric (which Collector also records) the sum() becomes quite useful: it gives the total number of dropped data points dropped by all processors, i.e. anywhere in the Collector. Here is another scenario that I think shows that a guideline "aggregating away a dimension should not result in double counting" is unnecessarily restrictive: Let's imagine a system where an incoming request requires participation of service A and service B, where service A calls service B for help. Both service A and service B emit a very standard "http.request.count" metric with "service.name" dimension. Do we consider it meaningless to aggregate away the "service.name" of "http.request.count" metric because it would mean double-counting of incoming request and thus this recording practice is invalid? What doe we expect service A and B to do instead? |
I like the Prometheus guideline. It's practical, and says almost the same thing as I am saying. It's not a strict rule, since there is no one-size-fits-all solution, it expects a judgement call of how generous we want to be with the interpretation of "meaningful". An even more generous interpretation is every time we bump a counter it's because of some event, so we can call the metric "events" and make everything else an attribute, and the sum() is the total count of events - meaningful, yet impractical. So we'd need some other criteria of what a practical "meaningful" is. I am not strongly opposed to having identical metrics from SDK and collector and only separating them by attribute (named how, btw?), but I am not seeing a strong argument why that would be a good idea either. A specific practical challenge we had in the past in Jaeger is that we used the prom-client as the metrics SDK, and it didn't allow redefining a metric with the same name twice in the same process (even with different attributes). So having two unrelated layers in the code emit the same metric required some special tricks, wasn't possible with off-the-shelf prom-client. Not saying it's a problem with the OTEL SDK, just shows that there was a little more to "meaningful" in the Prom's guideline. If we already have an established pattern of having different components emit identically named metrics, then we could continue with that pattern. It reminds me of another discussion about resource-identifying attributes vs. all metric attributes: here we designate some attributes as "semantics-changing". |
I've been diagnosing an OTel Collector that is instrumented with an SDK that exports through OTel Collector components arranged as an SDK exporter. The result is similar to what @yurishkuro described, I find myself confused because I'm getting metrics about dropped spans in two ways, both the collector pipeline's primary drops and the collector's instrumentation (secondary) drops. Although they are both literally "spans dropped", I would rather not try to make meaning from an aggregation of the two. I think it would be nice for us to define the semantics carefully for what we want to count, and then use separate but similar metric names for SDKs and collectors. The meaning of "Drop" has to be made very clear. Definitions for "Refused" and "Failed" and other adjectives we use are the important part here: we can re-use the definitions without re-using the same metric names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to using the same metrics in SDKs and the collector. Would instrumentation scope be the right way to distinguish between collector and SDK usage of them? instrumentation scope should become a label in prometheus (to yuri's line of questioning)
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | | ||
| `exporter.type` | string | Type of exporter being used. | `OtlpGrpcSpanExporter` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this essentially the same as the instrumentation scope name?
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `exporter.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | | ||
| `exporter.type` | string | Type of exporter being used. | `OtlpGrpcSpanExporter` | Recommended | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would different instances of the same "exporter" get a different type, or the same type? E.g. if I have two OTLP grpc span exporters, would I be able to tell how many spans each was exporting? In the collector today, I believe we can tell the difference between two instances of the same component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same type ;) I don't think we have a (clear) notion regarding differentiating two processors/exporters of the same type (at least in the SDK).
@yurishkuro I think for me this is a better argument against using the same metric. I don't know what the attribute name would be. @dashpole Using Scope name is a possibility but I am not sure what would be the values that we would use to distinguish between SDKs and Collector. |
@open-telemetry/collector-approvers @open-telemetry/collector-maintainers any thoughts on this? What are the current metrics we use in the Collector and do you think we could adopt in the Collector the conventions that this PR suggests? |
The spanmetric connector emits two metrics:
Both metrics can optionally have a namespace prepended via user confiugration. I think |
We don't seem to have a list of metrics emitted by the collector. We have metrics that are generated by the collector facilities ( I would absolutely welcome aligning the names/prefixes with the SDKs and I believe the ones @carlosalberto specifically proposed would be relatively easy to adopt. For reference, these are the ones I have in mind: |
@jpkrohling Thank you for highlighting what is, I think, a source of confusion about the collectors metrics. For a given unit of data (span, metric data point, log record), how many distinct standard outcomes are there? The exporterhelper generates at 1 and the obsreport genrates 2, I think:
Note that when an export is retried because configuration allows it, the items should not be counted multiple times, an in-flight export is not counted until its retries have been exhausted. |
We haven't been very prescriptive about scope names, but we could be for these special cases. For example, I prefer these names to be short so I might suggest scope name "open-telemetry/collector" (w/ collector release version) and "open-telemetry/sdk" (w/ SDK release version--which language identifiable via resource). |
The list of three metrics above (#184 (comment)) is, I think, incomplete when we consider how to handle
|
Hey @jmacd
There's the case of different exporters, e.g. "OtlpGrpcExporter", "OtlpHttpExporter", "OtlpHttpMyFeatureExporter", etc.
IIUC, you want to count the value from PartialSuccess as "refused", more than failed/dropped, right? In theory adding the dimension to something like |
@carlosalberto after reviewing the Collector and trying to make its metrics consistent with your proposal here, the result I came to is here: carlosalberto#1 This proposes one metric per component, with three levels of detail to cover basic, normal, and detailed use-cases. |
This was presented in today's Specification SIG, and while some discussion already began over the PR mentioned above (#184 (comment)), the agreement today was that I should open a new PR in this repository with a new PR history, then present it at next meetings of Collector SIG and Sem-Conv SIG. |
<!-- semconv metric.otel.processor.spans(full) --> | ||
| Attribute | Type | Description | Examples | Requirement Level | | ||
|---|---|---|---|---| | ||
| `processor.dropped` | boolean | Whether the Span was dropped or not. [1] | | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing the otel.
prefix? So otel.processor.dropped
instead of processor.dropped
?
Closing in favor of #598 |
Initial stab at adding metrics for processed/exported/dropped Spans. Made this metrics Span-specific, instead of specifying the signal type (metrics, logs), as not all signals may have the same semantics.
Any feedback will be greatly appreciated - although I'm specially interested in:
dropped
label - I think we could massage it to make it a common one, maybespans.dropped
? @jsuereth may have an opinion here.exporter.type
andprocessor.type
- is there any value in sharing this as well? Maybeotel.component
?Fixes part of #83