-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics: Requirements for safe attribute removal #1297
Comments
@hardproblems responding to open-telemetry/opentelemetry-collector-contrib#12300 (comment) and open-telemetry/opentelemetry-collector-contrib#12300 (comment)
I understand the concern. There are definitely these use-cases. The single-writer principle and the definition for overlapping streams were created to prevent problems at the consumer when sender's do not use unique IDs. Although I speculate that the problem with OpenTelemetry's metric data model discusses the requirements for safe removal but does not exactly spell out what the collector should do when it removes uniquely-identifying attributes. This issue was created, and I once drafted a PR #1649 which was premature at the time. The reason this is a corner case and not a earler problem in the OTel specification, is that it only impacts the collector. When an SDK is required to remove attributes (as it must to implement Views), it has many advantages over the collector. Because the SDK is a single-writer (by definition), it can easily combine (i.e., aggregate) multiple series to "safely" remove attributes--this is easy because its data is perfectly timestamp-aligned. When the collector sees data from multiple senders and wants to erase identifying attributes, it is required to perform the necessary aggregation. This requires some amount of batching and interpolating and is quite a sophisticated bit of work, though I would argue significantly less work than implementing "full-feature" recording rules. |
@jmacd is this still a priority and something you're willing to work on? |
What are you trying to achieve?
Make it safe to remove labels from OTLP metrics in a way that
preserves their semantics. We have identified two scenarios where it
is not safe to simply strip labels away from metrics data.
CUMULATIVE metric, the resulting timeseries has overlapping time
intervals (i.e., their
StartTimeUnixNanos
andTimeUnixNanos
overlap).
SumObserver
andUpDownSumObserver
data points are strippedof labels that were used to subdivide an observed sum, the resulting
points lose meaning.
An example for both is provided below.
CUMULATIVE points and the loss of unique labels
For the first item, OTLP has not specified how to interpret CUMULATIVE
data in this scenario. For reference, current OTLP Metrics Sum data
point defines
AggregationTemporality
as follows:and:
Simply translating overlapping CUMULATIVE data into Prometheus results in incorrect interpretation
because timeseries are not meant to have overlapping points (e.g.,
open-telemetry/opentelemetry-collector#2216).
For this reason, we should specify that CUMULATIVE timeseries MUST
always retain at least one uniquely identifying resource attribute.
This point suggests that
service.instance.id
should be a requiredresource attribute.: #1034
We should also specify how consumers are expected to handle the
situation where, because of a mis-configuration, data is presented with some
degree of overlap. The source of this condition is sometimes unsafe label removal but also results from a so-called "zombie" process, the case where multiple writers unintentionally produce
overlapping points.
This issue proposes that Metrics systems SHOULD ensure that
overlapping points are considered duplicates, not valid points. When
overlapping points are stored and queried, the system SHOULD ensure
that only one data point is taken. We might specify that metrics systems
SHOULD apply a heuristic that prefers data points with the youngest
StartTimeUnixNanos
under this condition, also that they SHOULD warnthe user about zombies or potentially unsafe label removal.
(UpDown)SumObserver points and subdivided sums
As an example of this scenario, consider observations being made
from a SumObserver callback:
process.cpu.usage{state=idle}
process.cpu.usage{state=user}
process.cpu.usage{state=system}
Now consider that we want to subdivide the process CPU usage by CPU
core number, making many more series:
process.cpu.usage{state=idle,cpu=0}
process.cpu.usage{state=idle,cpu=1}
process.cpu.usage{state=idle,cpu=...}
process.cpu.usage{state=user,cpu=0}
process.cpu.usage{state=user,cpu=1}
process.cpu.usage{state=user,cpu=...}
process.cpu.usage{state=system,cpu=0}
process.cpu.usage{state=system,cpu=1}
process.cpu.usage{state=system,cpu=...}
If these are were expressed as DELTA points (which requires
remembering the last value and performing subtraction), we can
correctly erase the
cpu
label requires by simple erasure. If theseare expressed as CUMULATIVE points, to erase the
cpu
label meansperforming summation, somehow.
Proposed collector algorithm
I'll propose an algorithm that I think we can use in the
OTel-Collector to correctly erase labels in both of these cases. It
requires introducing a delay over which aggregation takes place for
both DELTA and CUMULATIVE points.
The OTel-Go model SDK already implements a correct label "reducer"
Processor, but the job of the SDK is much simpler than the Collector's
job in this case, for two reasons: (1) the SDK never resets, (2) the
collection is naturally aligned (i.e., all SumObservers have the same
TimeUnixNanos
).Step 1:
S
.StartTime=S
Step 2: windowing / reducing
Step 3: spatial aggregation
StartTime
labelThis mimics the behavior of the OTel-Go SDK by introducing a temporary
StartTime
label to keep distinct timeseries separate until they aresummed. I will follow up with a more thorough explanation of this algorithm.
The text was updated successfully, but these errors were encountered: